Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 8, 2013

JQVMAP

Filed under: Geographic Data,Geography,GIS,JQuery — Patrick Durusau @ 3:47 pm

JQVMAP

From the webpage:

JQVMap is a jQuery plugin that renders Vector Maps. It uses resizable Scalable Vector Graphics (SVG) for modern browsers like Firefox, Safari, Chrome, Opera and Internet Explorer 9. Legacy support for older versions of Internet Explorer 6-8 is provided via VML.

Whatever your source of data, cellphone location data, user observation, etc., rendering it to a geographic display may be useful.

Abuse of Classification

Filed under: Cybersecurity,Government,Security — Patrick Durusau @ 3:37 pm

Obama orders US to draw up overseas target list for cyber-attacks by Glenn Greenwald.

From the post:

Barack Obama has ordered his senior national security and intelligence officials to draw up a list of potential overseas targets for US cyber-attacks, a top secret presidential directive obtained by the Guardian reveals.

The 18-page Presidential Policy Directive 20, issued in October last year but never published, states that what it calls Offensive Cyber Effects Operations (OCEO) “can offer unique and unconventional capabilities to advance US national objectives around the world with little or no warning to the adversary or target and with potential effects ranging from subtle to severely damaging”.

It says the government will “identify potential targets of national importance where OCEO can offer a favorable balance of effectiveness and risk as compared with other instruments of national power”.

The directive also contemplates the possible use of cyber actions inside the US, though it specifies that no such domestic operations can be conducted without the prior order of the president, except in cases of emergency.

I can’t post the document here so you will have to go there to read it.

Some things to keep in mind while you read directive.

First, these are the guys who have extensive military contracts with vendors who are routinely hacked by script kiddies. And then try to blame China because their contractors fail to take basic computer security seriously.

It’s like leaving your Porsche running while you go in for a hair cut. Who could have possibly prevented the car theft? Think real hard, you have ten (10) seconds.

Second, it looks like abuse of classification to me. Here’s the test: Give the URL to 10 people in your office and ask them to all summarize the same paragraph from the document. Not more than 20 word summary.

Check the results. This document would be safe alongside the SSN printed on the LifeLock trailer. I won’t say it means nothing at all but certainly nothing worth protecting.

Third, if a case can be made for classifying this document, is it on the basis of abuse of language?

That is it is so poorly drafted that if others mimic its style, the semantic content of government documents will be reduced generally?

Are You Near Me?

Filed under: Geographic Data,Georeferencing,GIS,Lucene — Patrick Durusau @ 1:52 pm

Lucene 4.X is a great tool for analyzing cellphone location data (Did you really think only the NSA has it?).

Chilamakuru Vishnu gets us started with a code heavy post with the promise of:

My Next Blog Post will talk about how to implement advanced spatial queries like

geoInterseting – where one polygon intersects with another polygon/line.

geoWithIn – where one polygon lies completely within another polygon.

Or you could obtain geolocation data by other means.

I first saw this at DZone.

Bradley Manning Trial Transcript (Funding Request)

Filed under: Law - Sources,Legal Informatics,Security — Patrick Durusau @ 1:22 pm

No, not for me.

Funding to support the creation of a public transcript of Bradley Manning’s trial.

Brian Merchant reports in: The Only Public Transcript of the Bradley Manning Trial Will Be Tapped Out on a Crowd-Funded Typewriter:

The Bradley Manning trial began this week, and it is being held largely in secret—according to the Freedom of the Press Foundation, 270 of the 350 media organizations that applied for access were denied. Major outlets like Reuters, the AP, and the Guardian, were forced to sign a document stating they would withhold certain information in exchange for the privilege of attending.

Oh, and no video or audio recorders allowed. And no official transcripts will be made available to anyone.  

But, the court evidently couldn't find grounds to boot out FPF's crowd-funded stenographers, who will be providing the only publicly available transcripts of the trial. (You can donate to the effort and read the transcripts here.)

Which is good news for journalists and anyone looking for an accurate—and public—record of the trial. But the fact that a volunteer stenographer is providing the only comprehensive source of information about such a monumental event is pretty absurd. 

The disclaimer that precedes each transcript epitomizes said absurdity. It reads: "This transcript was made by a court reporter who … was not permitted to be in the actual courtroom where the proceedings took place, but in a media room listening to and watching live audio/video feed, not permitted to make an audio backup recording for editing purposes, and not having the ability to control the proceedings in order to produce an accurate verbatim transcript."

In other words, it's a lone court reporter, frantically trying to tap out all the details down, technologically unaided, sequestered in a separate room, in one uninterrupted marathon session. And this will be the definitive record of the trial for public consumption. What's the logic behind this, now? Why allow an outside stenographer but not an audio recorder? Does the court just assume that no one will pay attention to the typed product? Or are they hoping to point to the reporter's fallibility in the instance that something embarrassing to the state is revealed? 

In case you missed it: Donate HERE to support public transcripts of the Bradley Manning trial.

Please donate and repost, reblog, tweet, email, etc., the support URL wherever possible.

Whatever the source of the Afghan War Dairies, they are proof that government secrecy is used to hide petty incompetence.

A transcript of the Bradley Manning trial will show government embarrassment, not national security, lies at the core of this trial.

I first saw this at Nat Torkington’s Four short links: 6 June 2013.

Big Data Open Source Tools

Filed under: BigData,Open Source — Patrick Durusau @ 9:33 am

Open Source Tools

If you are looking for an big data open source project or just want to illustrate the depth of open source software for big data, this is the graphic for you!

Categories:

  • Big Data Search
  • Business Inteligence
  • Data aggregation
  • Data Analysis & Platforms
  • Databases / Data warehousing
  • Data Mining
  • Document Store
  • Graphs
  • Grid Solutions
  • KeyValue
  • Multidimensional
  • Multimodel
  • Multivalue database
  • Object databases
  • Operational
  • Social
  • XML Databases

The basis for a trivia game at a conference? Moderator pulls name of a software project out of the hat and you have ten seconds to name three technical facts about the software?

Could be really amusing. Not quite the Newly Wed game but still amusing.

You can download it as a PDF.

June 7, 2013

NSA…Verizon…Obama…
Connecting the Dots. Or not.

Why Verizon?

The first question that came to mind when the Guardian broke the NSA-Verizon news.

Here’s why I ask:

Verizon market share

(source: http://www.statista.com/statistics/199359/market-share-of-wireless-carriers-in-the-us-by-subscriptions/)

Verizon over 2011-2012 had only 34% of the cell phone market.

Unless terrorists prefer Verizon for ideological reasons, why Verizon?

Choosing only Verizon means the NSA is missing 66% of potential terrorist cell traffic.

That sounds like a bad plan.

What other reason could there be for picking Verizon?

Consider some other known players:

President Barack Obama, candidate for President of the United States, 2012.

“Bundlers” who gathered donations for Barack Obama:

Min Max Name City State Employer
$200,000 $500,000 Hill, David Silver Spring MD Verizon Communications
$200,000 $500,000 Brown, Kathryn Oakton VA Verizon Communications
$50,000 $100,000 Milch, Randal Bethesda MD Verizon Communications

(Source: OpenSecrets.org – 2012 Presidential – Bundlers)

BTW, the Max category means more money may have been given, but that is the top reporting category.

I have informally “identified” the bundlers as follows:

  • Kathryn C. Brown

    Kathryn C. Brown is senior vice president – Public Policy Development and Corporate Responsibility. She has been with the company since June 2002. She is responsible for policy development and issues management, public policy messaging, strategic alliances and public affairs programs, including Verizon Reads.

    Ms. Brown is also responsible for federal, state and international public policy development and international government relations for Verizon. In that role she develops public policy positions and is responsible for project management on emerging domestic and international issues. She also manages relations with think tanks as well as consumer, industry and trade groups important to the public policy process.

  • David A. Hill, Bloomberg Business Week reports: David A. Hill serves as Director of Verizon Maryland Inc.

    LinkedIn profile reports David A. Hill worked for Verizon, VP & General Counsel (2000 – 2006), Associate General Counsel (March 2006 – 2009), Vice President & Associate General Counsel (March 2009 – September 2011) “Served as a liaison between Verizon and the Obama Administration”

  • Randal S. Milch Executive Vice President – Public Policy and General Counsel

What is Verizon making for each data delivery? Is this cash for cash given?

If someone gave your more than $1 million (how much more is unknown), would you talk to them about such a court order?

If you read the “secret” court order, you will notice it was signed on April 23, 2013.

There isn’t a Kathryn C. Brown in Oakton in the White House visitor’s log, but I did find this record, where a “Kathryn C. Brown” made an appointment at the Whitehouse and was seen two (2) days later on the 17th of January 2013.

BROWN,KATHRYN,C,U69535,,VA,,,,,1/15/13 0:00,1/17/13 9:30,1/17/13 23:59,,176,CM,WIN,1/15/13 11:27,CM,,POTUS/FLOTUS,WH,State Floo,MCNAMARALAWDER,CLAUDIA,,,04/26/2013

I don’t have all the dots connected because I am lacking some unknown # of the players, internal Verizon communications, Verizon accounting records showing government payments, but it is enough to make you wonder about the purpose of the “secret” court order.

Was it a serious attempt at gathering data for national security reasons?

Or was it gathering data as a pretext for payments to Verizon or other contractors?

My vote goes for “pretext for payments.”

I say that because using data from different sources has always been hard.

In fact, about 60 to 80% of the time of a data analyst is spent “cleaning up data” for further processing.

The phrase “cleaning up data” is the colloquial form of “semantic impedance.”

Semantic impedance happens when the same people are known by different names in different data sets or different people are known by the same names in the same or different data sets.

Remember Kathryn Brown, of Oakton, VA? One of the Obama bundlers. Let’s use her as an example of “semantic impedance.”

The FEC has a record for Kathryn Brown of Oakton, VA.

But a search engine found:

Kathryn C. Brown

Same person? Or different?

I found another Kathryn Brown at Facebook:

And an image of Facebook Kathryn Brown:

Kathryn Brown, Facebook

And a photo from a vacation she took:

Bangkok

Not to mention the Kathryn Brown that I found at Twitter.

Kathryn Brown, Twitter

That’s only four (4) data sources and I have at least four (4) different Kathryn Browns.

Across the United States, a quick search shows 227,000 Kathryn Browns.

Remember that is just a personal name. What about different forms of addresses? Or names of employers? Or job descriptions? Or simple errors, like the 20% error rate in credit report records.

Take all the phones, plus names, addresses, employers, job descriptions, errors + other data and multiply that times 311.6 million Americans.

Can that problem be solved with petabytes of data and teraflops of processing?

Not a chance.

Remember that my identification of Kathryn “bundler” Brown with the Kathryn C. Brown of Verison was a human judgement, not an automatic rule. Nor would a computer think to check the White House visitor logs to see if another, possibly the same Kathryn C. Brown visited the White House before the secret order was signed.

Human judgement is required because all the data that the NSA has been collecting is “dirty” data, from one perspective or other. Either is is truly “dirty” in the sense of having errors or it is “dirty” in the sense it doesn’t play well with other data.

The Orwellian fearists can stop huffing and puffing about the coming eclipse of civil liberties. Those passed from view a short time after 9/11 with the passage of the Patriot Act.

That wasn’t the fault of ineffectual NSA data collection. American voters bear responsibility for the loss of civil liberties not voting leadership into office that would repeal the Patriot Act.

Ineffectual NSA data collection impedes the development of techniques that for a sanely scoped data collection effort could make a difference.

A sane scope for preventing terrorist attacks could be starting with a set of known or suspected terrorist phone numbers. Using all phone data (not just from Obama contributors), only numbers contacting or being contacted by those numbers would be subject to further analysis.

Using that much smaller set of phone numbers as identifiers, we could then collect other data, such as names and addresses associated with that smaller set of phone numbers. That doesn’t make the data any cleaner but it does give us a starting point for mapping “dirty” data sets into our starter set.

The next step would be create mappings from other data sets. If we say why we have created a mapping, others can evaluate the accuracy of our mappings.

Those tasks would require computer assistance, but they ultimately would be matters of human judgement.

Examples of such judgements exist, say for example in Palantir product line. If you watch Palantir Gotham being used to model biological relationships, take note of the results that were tagged by another analyst. And how the presenter tags additional material that becomes available to other researchers.

Computer assisted? Yes. Computer driven? No.

To be fair, human judgement is also involved in ineffectual NSA data collection efforts.

But it is human judgement that rewards sycophants and supporters, not serving the public interest.

An NSA Big Graph Experiment
[On Non-Real World Data]

Filed under: Accumulo,Graphs — Patrick Durusau @ 9:41 am

An NSA Big Graph Experiment by Paul Burkhardt and Chris Waring.

Slide presentation on processing graphs with Apache Accumulo.

Which has some impressive numbers:

Graph500

Except that if you review the Graph 500 Benchmark Specification,

N the total number of vertices, 2SCALE. An implementation may use any set of N distinct integers to number the vertices, but at least 48 bits must be allocated per vertex number. Other parameters may be assumed to fit within the natural word of the machine. N is derived from the problem’s scaling parameter.

You find that all the nodes are normalized (no duplicates).

Moreover, the Graph 500 Benchmark cites:

The graph generator is a Kronecker generator similar to the Recursive MATrix (R-MAT) scale-free graph generation algorithm [Chakrabarti, et al., 2004].

Which provides:

There is a subtle point here: we may have duplicate edges (ie., edges which fall into the same cell in the adjacency matrix), but we only keep one of them. (R-MAT: A Recursive Model for Graph Mining by Deepayan Chakrabarti,
Yiping Zhan, and Christos Faloutsos.

By design, the Graph 500 Benchmark operates on completely normalized graphs.

I mention that because the graphs from Verizon, credit bureaus, FaceBook, Twitter, etc. are anything but normalized, some internally but all externally to each other.

Scaling Big Data Mining Infrastructure: The Twitter Experience by Jimmy Lin and Dmitriy Ryaboy is a chilling outline of semantic impedance in data within a single organization. Semantic impedance that would be reflected in graph processing of that data.

How much more semantic impedance will be encountered when graphs are build from diverse data sources?

Bottom line: The NSA gets great performance from Accumulo on normalized graphs, graphs that do not reflect real-world, non-normalized data.

I first saw this NSA presentation at Here’s how the NSA analyzes all that call data by Derrick Harris.

June 6, 2013

edX Code

Filed under: Education,Python — Patrick Durusau @ 2:28 pm

edX Code

From the homepage:

Welcome to edX Code, where developers around the globe are working to create a next-generation online learning platform that will bring quality education to students around the world.

EdX is a not-for-profit enterprise composed of 27 leading global institutions, the xConsortium. Since our founding in May 2012, edX has been committed to an open source vision. We believe in pursuing non-profit, open-source opportunities for expanding online education around the world. We believe it’s important to support these efforts in visible and substantive ways, and that’s why we are opening up our platform and inviting the world to help us make it better.

If you think topic maps are relevant to education, then they should be relevant to online education.

Yes?

A standalone topic map application not needed in this context but I don’t recall any standalong application requirement.

I first saw this at: edX learning platform now all open source.

HTML 5.1 (new draft)

Filed under: HTML5,W3C — Patrick Durusau @ 2:18 pm

HTML 5.1 (new draft)

Abstract:

This specification defines the 5th major version, first minor revision of the core language of the World Wide Web: the Hypertext Markup Language (HTML). In this version, new features continue to be introduced to help Web application authors, new elements continue to be introduced based on research into prevailing authoring practices, and special attention continues to be given to defining clear conformance criteria for user agents in an effort to improve interoperability.

Important to watch as browsers become the dominant means of content delivery.

I first saw this at: New HTML 5.1 working draft released.

Population Growth and Climate Change (Hans Rosling)

Filed under: Graphics,Visualization — Patrick Durusau @ 2:12 pm

A good dose of Hans Rosling always inspires me to use clearer explanations.

Note: “Inspires,” I didn’t say I achieve clearer explanations.

I first saw this at Making Data tell a Story.

Search is Not a Solved Problem

Filed under: Lucene,Searching,Solr — Patrick Durusau @ 1:58 pm

From the description:

The brief idea behind this talk is that search is not a solved problem — there is still a big opportunity for building search (and finding?) capabilities for the kinds of questions that the current product fail to solve. For example, why do search engines just return a list of sorted URLs, but give me no information about the themes that are consistent across them?

Hmmm, “…themes that are consistent across them?”

Do you think she means subjects across URLs?

😉

Important point: What people post isn’t the same content that they consume!

Whose Afraid of the NSA?

Filed under: Cypher,Government,Privacy,Security — Patrick Durusau @ 1:17 pm

To Catch a Cyber-Thief

From the post:

When local police came calling with child porn allegations last January, former Saint John city councillor Donnie Snook fled his house clutching a laptop. It was clear that the computer contained damning data. Six months later, police have finally gathered enough evidence to land him in jail for a long time to come.

With a case seemingly so cut and dry, why the lag time? Couldn’t the police do a simple search for the incriminating info and level charges ASAP? Easier said than done. With computing devices storing terrabytes of personal data, it can take months before enough evidence can be cobbled together from reams of documents, emails, chat logs and text messages.

That’s all about to change thanks to a new technique developed by researchers at Concordia University, who have slashed the data-crunching time. What once took months now takes minutes.

Gaby Dagher and Benjamin Fung, researchers with the Concordia Institute for Information Systems Engineering, will soon publish their findings in Data & Knowledge Engineering. Law enforcement officers are already putting this research to work through Concordia’s partnership with Canada’s National Cyber-Forensics and Training Alliance, in which law enforcement organizations, private companies, and academic institutions work together to share information to stop emerging cyber threats and mitigate existing ones.

Thanks to Dagher and Fung, crime investigators can now extract hidden knowledge from a large volume of text. The researchers’ new methods automatically identify the criminal topics discussed in the textual conversation, show which participants are most active with respect to the identified criminal topics, and then provide a visualization of the social networks among the participants.

Dagher, who is a PhD candidate supervised by Fung, explains “the huge increase in cybercrimes over the past decade boosted demand for special forensic tools that let investigators look for evidence on a suspect’s computer by analyzing stored text. Our new technique allows an investigator to cluster documents by producing overlapping groups, each corresponding to a specific subject defined by the investigator.”

Have you heard about clustering documents? Searching large volumes of text? Producing visualizations of social networks?

The threat of government snooping on its citizens should be evaluated on its demonstrated competence.

The FBI wants special backdoors (like it has for telecommunications) just to monitor IP traffic. (Going Bright… [Hack Shopping Mall?])

It would help the FBI if they had our secret PGP keys.

There a thought, maybe we should all generate new PGP keys and send the secret key for that key to the FBI.

They may not ever intercept any traffic encrypted with those keys but they can get funding from Congress to maintain an archive of them and to run them against all IP traffic.

The NSA probably has better chops when it comes to data collection but identity mining?

Identity mining is something completely different.

(See: The NSA Verizon Collection Coming on DVD)

The NSA Verizon Collection Coming on DVD

Filed under: Cybersecurity,Government,Privacy,Security — Patrick Durusau @ 1:15 pm

Don’t you wish! 😉

Sadly U.S. citizens have to rely on the foreign press, NSA collecting phone records of millions of Verizon customers daily for minimal transparency of our own government.

According to the post:

Under the terms of the blanket order, the numbers of both parties on a call are handed over, as is location data, call duration, unique identifiers, and the time and duration of all calls. The contents of the conversation itself are not covered.

The order expires July 19, 2013. One response to get a Verizon account and setup a war games dialer to make 1-800 calls between now an July 19, 2013.

The other response is to think about the subject identity management issues with the Verizon data.

Bare Verizon Data

Let’s see, you get: numbers of both parties on a call, location data, call duration, unique identifiers, and the time and duration of all calls.”

Not all that difficult to create networks of the calls based on the Verizon data but that doesn’t get you identity of the people making the calls.

Identifying Individuals

What about matching the phone numbers to the major credit bureaus?

Where the FTC found:

A Federal Trade Commission study of the U.S. credit reporting industry found that five percent of consumers had errors on one of their three major credit reports that could lead to them paying more for products such as auto loans and insurance.

Overall, the congressionally mandated study on credit report accuracy found that one in five consumers had an error on at least one of their three credit reports.

In FTC Study, Five Percent of Consumers Had Errors on Their Credit Reports That Could Result in Less Favorable Terms for Loans

Bear in mind that the sources in credit reports have their own methods for identifying individuals, which are not exposed through the credit bureaus.

As I recall, credit reports don’t include political or social activity.

Identifying Social Networks

Assuming some rough set of names, it might be possible to match those names against FaceBook and other social media sites. And then to map the relationship there back to the relationships in the original Verizon data.

The main problem being that every data set uses a different means to identify the same individuals and associations between individuals.

You and I may be friends on FaceBook, doing business together on LinkedIn and have cell phone conversations in the Verizon data, but the question will be mapping all of those together.

And remembering that all those systems are dynamic. Knowing my network of contacts six (6) weeks ago may or may not be useful now.

To be useful, the NSA will need to query along different identifications in different systems for the same person and have the results returned across all the systems.

Otherwise, Verizon will show a healthy profit after July 19, 2013 in fees for delivery electronic copies of data it already collects.

“Secret” court order was necessary to make the sale.

The NSA will have a data set just because it exists. “Big data” supports funding goals.

Verizon users? No less privacy than they had when only Verizon had the data.

Vocabulary Management at W3C (Draft) [ontology and vocabulary as synonyms]

Filed under: Ontology,Semantics,Vocabularies — Patrick Durusau @ 8:51 am

Vocabulary Management at W3C (Draft)

From the webpage:

One of the major stumbling blocks in deploying RDF has been the difficulty data providers have in determining which vocabularies to use. For example, a publisher of scientific papers who wants to embed document metadata in the web pages about each paper has to make an extensive search to find the possible vocabularies and gather the data to decide which among them are appropriate for this use. Many vocabularies may already exist, but they are difficult to find; there may be more than one on the same subject area, but it is not clear which ones have a reasonable level of stability and community acceptance; or there may be none, i.e. one may have to be developed in which case it is unclear how to make the community know about the existence of such a vocabulary.

There have been several attempts to create vocabulary catalogs, indexes, etc. but none of them has gained a general acceptance and few have remained up for very long. The latest notable attempt is LOV, created and maintained by Bernard Vatant (Mondeca) and Pierre-Yves Vandenbussche (DERI) as part of the DataLift project. Other application areas have more specific, application-dependent catalogs; e.g., the HCLS community has established such application-specific “ontology portals” (vocabulary hosting and/or directory services) as NCBO and OBO. (Note that for the purposes of this document, the terms “ontology” and “vocabulary” are synonyms.) Unfortunately, many of the cataloging projects in the past relied on a specific project or some individuals and they became, more often than not, obsolete after a while.

Initially (1999-2003) W3C stayed out of this process, waiting to see if the community would sort out this issue by itself. We hoped to see the emergence of an open market for vocabularies, including development tools, reviews, catalogs, consultants, etc. When that did not emerge, we decided to begin offering ontology hosting (on www.w3.org) and we began the Ontaria project (with DARPA funding) to provide an ontology directory service. Implementation of these services was not completed, however, and project funding ended in 2005. After that, W3C took no active role until the emergence of schema.org and the eventual creation of the Web Schemas Task Force of the Semantic Web Interest Group. WSTF was created both to provide an open process for schema.org and as a general forum for people interested in developing vocabularies. At this point, we are contemplating taking a more active role supporting the vocabulary ecosystem. (emphasis added)

The W3C proposal fails to address two issues with vocabularies:

1. Vocabularies are not the origin of the meanings of terms they contain.

Awful, according to yet another master of the king’s English quoted by Fries, could only mean awe-inspiring.

But it was not so. “The real meaning of any word,” argued Fries, “must be finally determined, not by its original meaning, it source or etymology, but by the content given the word in actual practical usage…. Even a hardy purist would scarcely dare pronounce a painter’s masterpiece awful, without explanations. [The Story of Ain’t by David Skinner, HarperCollins 2012, page 47)

Vocabularies represent some community of semantic practice but that brings us to the second problem the W3C proposal ignores.

2. The meaning of terms in a vocabulary are not stable, universal nor self-evident.

The problem with most vocabularies being they have no way to signal the the context, community or other information that would help distinguish one vocabulary meaning from another.

A human reader may intuit context and other clues from a vocabulary and use those factors when comparing the vocabulary to a text.

Computers, on the other hand, know no more than they have been told.

Vocabularies need to move beyond being simple tokens and represent terms with structures that capture some of the information a human reader knows intuitively about those terms.

Otherwise vocabularies will remain mute records of some socially defined meaning, but we won’t know which ones.

Speeding Through Haskell

Filed under: Functional Programming,Haskell — Patrick Durusau @ 7:54 am

Speeding Through Haskell: With Example Code by Mihai-Radu Popescu.

A work in progress (at 87 pages) with a breezy writing style and lots of examples.

Only available in PDF format.

Unfortunate because adding your own notes as you work through the examples would make it more valuable to you. There would be the issue of migrating your notes to a later version, a problem that remains after 20+ years of markup.

The download page is in Spanish but the text is in English.

Either download link returns the same content, one as a .zip file containing the PDF file and the other as a PDF file.

How would you solve the note migration issue?

I first saw this in a tweet by CompSciFact.

Balisage 2013 – Late-Breaking – Deadline June 14, 2013

Filed under: Conferences,Topic Maps — Patrick Durusau @ 7:32 am

You saw the papers that made the cut for Balisage 2013.

You know you can do better!

See the rules for Late-breaking News.

Special Offer: If you have a late-breaking proposal for Balisage 2013 on topic maps, I volunteer to copy-edit (not write) your proposal for free.

Up to you whether to accept or reject my suggested edits.

FYI: If accepted, competition is fierce, you need to do the presentation. I help edit but I no longer travel.

June 5, 2013

Interactive Entity Resolution in Relational Data… [NG Topic Map Authoring]

Filed under: Authoring Semantics,Authoring Topic Maps,Deduplication,Integration — Patrick Durusau @ 4:47 pm

Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation by Hyunmo Kang, Lise Getoor, Ben Shneiderman, Mustafa Bilgic, Louis Licamele.

Abstract:

Databases often contain uncertain and imprecise references to real-world entities. Entity resolution, the process of reconciling multiple references to underlying real-world entities, is an important data cleaning process required before accurate visualization or analysis of the data is possible. In many cases, in addition to noisy data describing entities, there is data describing the relationships among the entities. This relational data is important during the entity resolution process; it is useful both for the algorithms which determine likely database references to be resolved and for visual analytic tools which support the entity resolution process. In this paper, we introduce a novel user interface, D-Dupe, for interactive entity resolution in relational data. D-Dupe effectively combines relational entity resolution algorithms with a novel network visualization that enables users to make use of an entity’s relational context for making resolution decisions. Since resolution decisions often are interdependent, D-Dupe facilitates understanding this complex process through animations which highlight combined inferences and a history mechanism which allows users to inspect chains of resolution decisions. An empirical study with 12 users confirmed the benefits of the relational context visualization on the performance of entity resolution tasks in relational data in terms of time as well as users’ confidence and satisfaction.

Talk about a topic map authoring tool!

Even chains entity resolution decisions together!

Not to be greedy, but interactive data deduplication and integration in Hadoop would be a nice touch. 😉

Software: D-Dupe: A Novel Tool for Interactive Data Deduplication and Integration.

Entity recognition with Scala and…

Filed under: Entity Resolution,Natural Language Processing,Scala,Stanford NLP — Patrick Durusau @ 4:05 pm

Entity recognition with Scala and Stanford NLP Named Entity Recognizer by Gary Sieling.

From the post:

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.

(…)

The voting on entity recognition made me curious about interactive entity resolution where a user has a voice.

See the next post.

Usability & User Experience Community

Filed under: Interface Research/Design,Usability,UX — Patrick Durusau @ 3:51 pm

Usability & User Experience Community

From the webpage:

This web site is a forum to share information and experiences on issues related to the usability and user-centered design. It is the home of the Usability and User Experience Community of the Society for Technical Communication.

Home of the Heuristic Evaluation – A System Checklist resource.

An abundance of usability resources, particularly under “New to Usability?”

Every hour you spend at this site may save users days of unproductive annoyance with your products.

Heuristic Evaluation – A System Checklist

Filed under: Interface Research/Design,Usability,UX — Patrick Durusau @ 3:13 pm

Heuristic Evaluation – A System Checklist by Deniese Pierotti.

An interface review checklist, topic followed by # of questions:

  1. Visibility of System Status (29)
  2. Match Between System and the Real World (24)
  3. User Control and Freedom (23)
  4. Consistency and Standards (51)
  5. Help Users Recognize, Diagnose, and Recover from Errors (21)
  6. Error Prevention (15)
  7. Recognition Rather Than Recall (40)
  8. Flexibility and Minimalist Design (16)
  9. Aesthetic and Minimalist Design (12)
  10. Help and Documentation (23)
  11. Skills (22)
  12. Pleasurable and Respectful Interaction with the User (17)
  13. Privacy (3)

Almost three hundred (300) questions to make you think about your application and its interface.

A good basis for a web form populated with a history of prior ratings and comments, along with space for entry of new ratings and comments.

Being able to upload screen shots would be a nice touch as well.

I may be doing some UI evaluation soon so I will have to keep this in mind.

Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers

Filed under: Cloudera,Hadoop,Lucene,Solr — Patrick Durusau @ 2:41 pm

Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers by Doug Cutting.

From the post:

One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.

Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.

Cloudera Search is released to public beta, as of today. (See a demo here; get installation instructions here.) Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

In the context of our platform, CDH (Cloudera’s Distribution including Apache Hadoop), Cloudera Search is another framework much like MapReduce and Cloudera Impala. It’s another way for users to interact with Hadoop data and for developers to build Hadoop applications. Each framework in our platform is designed to cater to different families of applications and users:

(…)

Did you catch the line:

Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

Does that make you feel better about scale issues?

Also see: Cloudera Search Webinar, Wednesday, June 19, 2013 11AM-12PM PT.

A serious step up in capabilities.

International Tracing Service Archive

Filed under: Data Mining,Dataset,Topic Maps — Patrick Durusau @ 10:45 am

International Tracing Service Archive (U.S. Holocaust Memorial Museum)

The posting on Crowdsourcing + Machine Learning… reminded me to check on access to the archives of the International Tracking Service.

Let’s just say the International Tracking Service has a poor track record on accessibility to its archives. An archive of documents the ITS describes as:

Placed end-to-end, the documents in the ITS archives would extent to a length of about 26,000 metres.

Fortunately digitized copies of portions of the archives are available at other locations, such as the U.S. Holocaust Memorial Museum.

The FAQ on the archives answers the question “Are the records goings to be on the Internet?” this way:

Regrettably, the collection was neither organized nor digitized to be directly searchable online. Therefore, the Museum’s top priority is to develop software and a database that will efficiently search the records so we can quickly respond to survivor requests for information.

Only a small fraction of the records are machine readable. In order to be searched by Google or Yahoo! search engines, all of the data must be machine readable.

Searching the material is an arduous task in any event. The ITS records are in some 25 different languages and contain millions of names, many with multiple spellings. Many of the records are entirely handwritten. In cases where forms were used, the forms are written in German and the entries are often handwritten in another language.

The best way to ensure that survivors receive accurate information quickly and easily will be by submitting requests to the Museum by e-mail, regular mail, or fax, and trained Museum staff will assist with the research. The Museum will provide copies of all relevant original documents to survivors who wish to receive them via e-mail or regular mail.

The priority of the Museum is in answering requests for information from survivors.

However, we do know that multiple languages and handwritten texts are not barriers to creating machine readable texts for online searching.

The searches would not be perfect but even double-key entry of all the data would not be perfect.

What better way to introduce digital literate generations to the actuality of the Holocaust than to involve them in crowd-sourcing the proofing of a machine transcription of this archive?

Then the Holocaust would not a few weeks in history class or a museum or memorial to visit but experience with documents of the fates of millions.

PS: Creating trails through the multiple languages, spellings, locations, etc., by researchers than can be enhanced by other researchers, would highlight the advantages of topic maps in historical research.

Texas Conference on Digital Libraries 2013

Filed under: Digital Library,Librarian/Expert Searchers,Library — Patrick Durusau @ 9:33 am

Texas Conference on Digital Libraries 2013

Abstracts and in many cases presentations from the Texas Conference on Digital Libraries 2013.

A real treasure trove on digital libraries projects and issues.

Library: A place where IR isn’t limited by software.

Crowdsourcing + Machine Learning…

Filed under: Crowd Sourcing,Machine Learning,Manuscripts — Patrick Durusau @ 9:20 am

Crowdsourcing + Machine Learning: Nicholas Woodward at TCDL by Ben W. Brumfield.

I was so impressed by Nicholas Woodward’s presentation at TCDL this year that I asked him if I could share “Crowdsourcing + Machine Learning: Building an Application to Convert Scanned Documents to Text” on this blog.

Hi. My name is Nicholas Woodward, and I am a Software Developer for the University of Texas Libraries. Ben Brumfield has been so kind as to offer me an opportunity to write a guest post on his blog about my approach for transcribing large scanned document collections that combines crowdsourcing and computer vision. I presented my application at the Texas Conference on Digital Libraries on May 7th, 2013, and the slides from the presentation are available on TCDL’s website. This purpose of this post is to introduce my approach along with a test collection and preliminary results. I’ll conclude with a discussion on potential avenues for future work.

Before we delve into algorithms for computer vision and what-not, I’d first like to say a word about the collection used in this project and why I think it’s important to look for new ways to complement crowdsourcing transcription. The Guatemalan National Police Historical Archive (or AHPN, in Spanish) contains the records of the Guatemalan National Police from 1882-2005. It is estimated that AHPN contains more than 80 million pages of documents (8,000 linear meters) such as handwritten journals and ledgers, birth certificate and marriage license forms, identification cards and typewritten letters. To date, the AHPN staff have processed and digitized approximately 14 million pages of the collection, and they are publicly available in a digital repository that was developed by UT Libraries.

While unique for its size, AHPN is representative of an increasingly common problem in the humanities and social sciences. The nature of the original documents precludes any economical OCR solution on the scanned images (See below), and the immense size of the collection makes page-by-page transcription highly impractical, even when using a crowdsourcing approach. Additionally, the collection does not contain sufficient metadata to support browsing via commonly used traits, such as titles or authors of documents.

A post at the intersection of many of my interests!

Imagine pushing this just a tad further to incorporate management of subject identity, whether visible to the user or not.

Trends in Machine Learning [SciPy]

Filed under: Machine Learning,Python — Patrick Durusau @ 8:11 am

Trends in Machine Learning by Olivier Grisel.

Slides from presentation at Paris DataGeeks 2013.

Focus is on Python and SciPy.

Covers probabilistic programming, deep learning, and has links at the end.

Good way to check your currency on machine learning with Python.

crowdcrafting

Filed under: Crowd Sourcing,Interface Research/Design,Usability — Patrick Durusau @ 8:04 am

crowdcrafting

Crowdcrafting is an instance of PyBossa:

From the about page:

PyBossa is a free, open-source crowd-sourcing and micro-tasking platform. It enables people to create and run projects that utilise online assistance in performing tasks that require human cognition such as image classification, transcription, geocoding and more. PyBossa is there to help researchers, civic hackers and developers to create projects where anyone around the world with some time, interest and an internet connection can contribute.

PyBossa is different to existing efforts:

  • It’s a 100% open-source
  • Unlike, say, “mechanical turk” style projects, PyBossa is not designed to handle payment or money — it is designed to support volunteer-driven projects.
  • It’s designed as a platform and framework for developing deploying crowd-sourcing and microtasking apps rather than being a crowd-sourcing application itself. Individual crowd-sourcing apps are written as simple snippets of Javascript and HTML which are then deployed on a PyBossa instance \(such as_ CrowdCrafting.org). This way one can easily develop custom apps while using the PyBossa platform to store your data, manage users, and handle workflow.

You can read more about the architecture in the PyBossa Documentation and follow the step-by-step tutorial to create your own apps.

Are interfaces for volunteer projects better than for-hire projects?

Do they need to be?

How would you overcome the gap between “…this is how I see the interface (the developers)…” versus the interface that users prefer?

Hint: 20th century advertising discovered that secret decades ago. See: Predicting What People Want and especially the reference to Selling Blue Elephants.

June 4, 2013

Full Healthcare Interoperability “…may take some creative thinking.”

Filed under: Health care,Interoperability — Patrick Durusau @ 3:41 pm

Completing drive toward healthcare interoperability will be challenge by Ed Burns.

From the post:

The industry has made progress toward healthcare interoperability in the last couple years, but getting over the final hump may take some creative thinking. There are still no easy answers for how to build fully interoperable nationwide networks.

At the Massachusetts Institute of Technology CIO Symposium, held May 22 in Cambridge, Ma., Beth Israel Deaconess Medical Center CIO John Halamka, M.D., said significant progress has been made.

In particular, he pointed to the growing role of the Clinical Document Architecture (CDA) standard. Under the 2014 Certification Standards, EHR software must be able to produce transition of care documents in this form.

But not every vendor has reached the point where it fully supports this standard, and it is not the universal default for clinician data entry. Additionally, Halamka pointed out that information in health records tends to be incomplete. Often the worker responsible for entering important demographic data and other information into the record is the least-trained person on the staff, which can increase the risk of errors and produce bad data.

There are ways around the lack of vendor support for healthcare data interoperability. Halamka said most states’ information exchanges can function as middleware. As an example, he talked about how Beth Israel is able to exchange information with Atrius Health, a group of community-based hospitals in Eastern Massachusetts, across the state’s HIE even though the two networks are on different systems.

“You can get around what the vendor is able to do with middleware,” Halamka said.

But while these incremental changes have improved data interoperability, supporting full interconnectedness across all vendor systems and provider networks could take some new solutions.

Actually “full” healthcare interoperability isn’t even a possibility.

What we can do is decide how much interoperability is worth in particular situations and do the amount required.

Everyone in the healthcare industry has one or more reasons for the formats and semantics they use now.

Changing those formats and semantics requires not only changing the software but training the people who use the software and the data it produces.

Not to mention the small task of deciding on what basis interoperability will be built.

As you would expect, I think a topic map as middleware solution, one that ties diverse systems together in a re-usable way, is the best option.

Convincing the IT system innocents that write healthcare policy that demanding interoperability isn’t an effective strategy would be a first step.

What would you suggest as a second step?

Call for Participation in Argument Representation Community Group

Filed under: Argumentation,W3C — Patrick Durusau @ 3:26 pm

Call for Participation in Argument Representation Community Group

From the post:

Argument-Representation’s mission is to recommend a standardized representation for formal argument. It is not intended to augment XML in any other way.

The group does not necessarily commit to creating a novel representation. For instance, after due consideration it could endorse an existing one or recommend accepting an existing one with minor changes.

Formal argument means a formalizable set of connected statements or statement-like objects intended to establish a proposition.

Do you think we have been here before?

Common Logic for example?

Knowledge Interchange Format as another?

Others? 😉

We have met the source of semantic diversity and it is us.

WeevilScout [Distributed Browser Computing]

Filed under: Distributed Computing,Javascript,Web Browser — Patrick Durusau @ 2:35 pm

WeevilScout

From this poster:

The proliferation of web browsers and the performance gain being achieved by current JavaScript virtual machines raises the question whether Internet browsers can become yet another middleware for distributed computing.

Will we need new HPC benchmarks when 10 million high end PCs link their web browser JavaScript engines together?

What about 20 million high end PCs?

But the ability to ask questions of large data sets is no guarantee that we will formulate good questions to ask.

Pointers to discussions on how to decide what questions to ask?

Or do we ask the old questions and just get the results more quickly?

I first saw this at Nat Torkinton’s Four short links: 4 June 2013.

libsregex

Filed under: Regex,Searching — Patrick Durusau @ 2:22 pm

libsregex by Yichun Zhang.

From the homepage:

libsregex – A non-backtracking regex engine library for large data streams

And see:

Streaming regex matching and substitution by the sregex library by Yichun Zhang.

This looks quite good!

I first saw this at Nat Torkinton’s Four short links: 4 June 2013.

« Newer PostsOlder Posts »

Powered by WordPress