Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 16, 2013

From Records to a Web of Library Data – Pt1 Entification

Filed under: Entities,Library,Linked Data — Patrick Durusau @ 3:10 pm

From Records to a Web of Library Data – Pt1 Entification by Richard Wallis.

From the post:

Entification

Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…

What is it, and why should I care, you may be asking.

I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources. Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.

As you know, I am far from convinced about the “benefits” of Linked Data, at least with its current definition.

Who knows what definition “Linked Data” may have in some future vision of the W3C? (URL Homonym Problem: A Topic Map Solution, a tale of how the W3C decided to redefine URL.)

But Richard’s point about the ugliness and utility of “entification” is well taken.

So long as you remember that every term can be described “in terms of other things.”

There are no primitive terms, not one.

Cypher basics: it all starts with the START

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 2:27 pm

Cypher basics: it all starts with the START by Wes Freeman.

From the post:

“It all starts with the START” -Michael Hunger, Cypher webinar, Sep 2012

The start clause is one of those things that seems quite simple initially. You specify your start point(s) for the rest of the query. Typically, you use an index lookup, or if you’re just messing around, a node id (or list of node ids). This sets the stage for you to match a traversal pattern, or just filter your nodes with a where. Let’s start with a simple example–here we’re going to find a single node, and return it (later we’ll get into why start is sort of like a SQL from):

Wes continues his excellent introduction to Cypher.

Finding Shakespeare’s Favourite Words With Data Explorer

Filed under: Data Explorer,Data Mining,Excel,Microsoft,Text Mining — Patrick Durusau @ 2:07 pm

Finding Shakespeare’s Favourite Words With Data Explorer by Chris Webb.

From the post:

The more I play with Data Explorer, the more I think my initial assessment of it as a self-service ETL tool was wrong. As Jamie pointed out recently, it’s really the M language with a GUI on top of it and the GUI itself, while good, doesn’t begin to expose the power of the underlying language: I’d urge you to take a look at the Formula Language Specification and Library Specification documents which can be downloaded from here to see for yourself. So while it can certainly be used for self-service ETL it can do much, much more than that…

In this post I’ll show you an example of what Data Explorer can do once you go beyond the UI. Starting off with a text file containing the complete works of William Shakespeare (which can be downloaded from here – it’s strange to think that it’s just a 5.3 MB text file) I’m going to find the top 100 most frequently used words and display them in a table in Excel.

If Data Explorer is a GUI on top of M (outdated but a point of origin), it goes up in importance.

From the M link:

The Microsoft code name “M” Modeling Language, hereinafter referred to as M, is a language for modeling domains using text. A domain is any collection of related concepts or objects. Modeling domain consists of selecting certain characteristics to include in the model and implicitly excluding others deemed irrelevant. Modeling using text has some advantages and disadvantages over modeling using other media such as diagrams or clay. A goal of the M language is to exploit these advantages and mitigate the disadvantages.

A key advantage of modeling in text is ease with which both computers and humans can store and process text. Text is often the most natural way to represent information for presentation and editing by people. However, the ability to extract that information for use by software has been an arcane art practiced only by the most advanced developers. The language feature of M enables information to be represented in a textual form that is tuned for both the problem domain and the target audience. The M language provides simple constructs for describing the shape of a textual language – that shape includes the input syntax as well as the structure and contents of the underlying information. To that end, M acts as both a schema language that can validate that textual input conforms to a given language as well as a transformation language that projects textual input into data structures that are amenable to further processing or storage.

I try to not run examples using Shakespeare. I get distracted by the elegance of the text, which isn’t the point of the exercise. 😉

MetaNetX.org…

Filed under: Bioinformatics,Biomedical,Genomics,Modeling,Semantic Diversity — Patrick Durusau @ 1:42 pm

MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks by Mathias Ganter, Thomas Bernard, Sébastien Moretti, Joerg Stelling and Marco Pagni. (Bioinformatics (2013) 29 (6): 815-816. doi: 10.1093/bioinformatics/btt036)

Abstract:

MetaNetX.org is a website for accessing, analysing and manipulating genome-scale metabolic networks (GSMs) as well as biochemical pathways. It consistently integrates data from various public resources and makes the data accessible in a standardized format using a common namespace. Currently, it provides access to hundreds of GSMs and pathways that can be interactively compared (two or more), analysed (e.g. detection of dead-end metabolites and reactions, flux balance analysis or simulation of reaction and gene knockouts), manipulated and exported. Users can also upload their own metabolic models, choose to automatically map them into the common namespace and subsequently make use of the website’s functionality.

http://metanetx.org.

The authors are addressing a familiar problem:

Genome-scale metabolic networks (GSMs) consist of compartmentalized reactions that consistently combine biochemical, genetic and genomic information. When also considering a biomass reaction and both uptake and secretion reactions, GSMs are often used to study genotype–phenotype relationships, to direct new discoveries and to identify targets in metabolic engineering (Karr et al., 2012). However, a major difficulty in GSM comparisons and reconstructions is to integrate data from different resources with different nomenclatures and conventions for both metabolites and reactions. Hence, GSM consolidation and comparison may be impossible without detailed biological knowledge and programming skills. (emphasis added)

For which they propose an uncommon solution:

MetaNetX.org is implemented as a user-friendly and self-explanatory website that handles all user requests dynamically (Fig. 1a). It allows a user to access a collection of hundreds of published models, browse and select subsets for comparison and analysis, upload or modify new models and export models in conjunction with their results. Its functionality is based on a common namespace defined by MNXref (Bernard et al., 2012). In particular, all repository or user uploaded models are automatically translated with or without compartments into the common namespace; small deviations from the original model are possible due to the automatic reconciliation steps implemented by Bernard et al. (2012). However, a user can choose not to translate his model but still make use of the website’s functionalities. Furthermore, it is possible to augment the given reaction set by user-defined reactions, for example, for model augmentation.

The bioinformatics community recognizes the intellectual poverty of lock step models.

Wonder when the intelligence community is going to have that “a ha” moment?

The Next 700 Programming Languages
[Essence of Topic Maps]

Filed under: Language Design,Programming,Subject Identity,Topic Maps — Patrick Durusau @ 1:14 pm

The Next 700 Programming Languages by P. J. Landin.

ABSTRACT:

A family of unimplemented computing languages is described that is intended to span differences of application area by a unified framework. This framework dictates the rules about the uses of user-coined names, and the conventions about characterizing functional relationships. Within this framework ‘lhe design of a specific language splits into two independent parts. One is the choice of written appearances of programs (or more generally, their physical representation). The other is the choice of the abstract entities (such as numbers, character-strings, lists of them, functional relations among them) that can be referred to in the language.

The system is biased towards “expressions” rather than “statements.” It includes a nonprocedural (purely functional) subsystem that aims to expand the class of users’ needs that can be met by a single print-instruction, without sacrificing the important properties that make conventional right-hand-side expressions easy to construct and understand.

The introduction to this paper reminded me of an acronym, SWIM (See What I Mean) that was coined to my knowledge by Michel Biezunski several years ago:

Most programming languages are partly a way of expressing things in terms of other things and partly a basic set of given things. The ISWIM (If you See What I Mean) system is a byproduct of an attempt to disentangle these two aspects in some current languages.

This attempt has led the author to think that many linguistic idiosyncracies are concerned with the former rather than the latter, whereas aptitude for a particular class of tasks is essentially determined by the latter rather than the former. The conclusion follows that many language characteristics are irrelevant to the alleged problem orientation.

ISWIM is an attempt at a general purpose system for describing things in terms of other things, that can be problem-oriented by appropriate choice of “primitives.” So it is not a language so much as a family of languages, of which each member is the result of choosing a set of primitives. The possibilities concerning this set and what is needed to specify such a set are discussed below.

The essence of topic maps is captured by:

ISWIM is an attempt at a general purpose system for describing things in terms of other things, that can be problem-oriented by appropriate choice of “primitives.”

Every information system has a set of terms, the meaning of which are known to its designers and/or users.

Data integration issues arise from the description of terms, “in terms of other things,” being known only to designers and users.

The power of topic maps comes from the expression of descriptions “in terms of other things,” for terms.

Other designers or users can examine those descriptions to see if they recognize any terms similar to those they know by other descriptions.

If they discover descriptions they consider to be of same thing, they can then create a mapping of those terms.

Hopefully using the descriptions as a basis for the mapping. A mapping of term to term only multiplies the opaqueness of the terms.

For some systems, Social Security Administration databases for example, descriptions of terms “in terms of other things” may not be part of the database itself. But descriptions maintained as “best practice” to facilitate later maintenance and changes.

For other systems, U.S. Intelligence community as another example, still chasing the will-o’-the-wisp* of standard terminology for non-standard terms, even the possibility of interchange depends on the development of description of terms “in terms of other things.”

Before you ask, yes, yes the Topic Maps Data Model (TMDM) and the various Topic Maps syntaxes are terms that can be described “in terms of other things.”

The advantage of the TMDM and relevant syntaxes is that even if not described “in terms of other things,” standardized terms enable interchange of a class of mappings. The default identification mapping in the TMDM being by IRIs.

Before and since Landin’s article we have been producing terms that could be described “in terms of other things.” In CS and other areas of human endeavor as well.

Isn’t it about time we starting describing our terms rather than clamoring for one set of undescribed terms or another?


* I use the term will-o’-the-wisp quite deliberately.

After decades of failure to create universal information systems with computers, following on centuries of non-computer failures to reach the same goal, following on millennia of semantic and linguistic diversity, someone knows attempts at universal information systems will leave intelligence agencies not sharing critical data.

Perhaps the method you choose says a great deal about the true goals of your project.

I first saw this in a tweet by CompSciFact.

March 15, 2013

CIA Prophet Pointed to Big Data Future

Filed under: BigData,Semantics — Patrick Durusau @ 7:10 pm

CIA Prophet Pointed to Big Data Future by Issac Lopez.

Issac writes:

“What does the size of the next coffee crop, bull flight attendance figures, local newspaper coverage of UN matters, the birth rate, the mean daily temperatures or refrigerator sales across the country have to do with who will next be elected president of Guatemala,” asks Orrin Clotworthy in the report, which he styled “a Jules Verne look at intelligence processes in a coming generation.”

“Perhaps nothing” he answers, but notes that there is a cause behind each vote cast in an election and many quantitative factors may exist to help shape that decision. “To learn just what the factors are, how to measure them, how to weight them, and how to keep them flowing into a computing center for continual analysis will some day be a matter of great concern to all of us in the intelligence community,” prophesied Clotworthy, describing the challenges that organizations around the globe face fifty years after the report was authored.

I’m not sure if Issac means big data is closer to measuring the factors that motivate people or if big data will seize upon what can be measured as motivators.

The issue of standardized tests is a current one in the United States and it is far from settled whether the tests measure anything about the educational process or do they measure the ability to take standardized tests? Or measure some other aspect of students?

You can read the report in full here.

Issac quotes another part of the report but only in part:

IBM has developed for public use a computer-based system called the ‘Selective Disseminator of Information.’ Intended for large organizations dealing with heterogeneous masses of information, it scans all incoming material and delivers those items that are of interest to specific offices in accordance with “profiles” of their needs which are continuously updated by a feed-back device.

But Clotworthy continues in the next sentence to say:

Any comment hear on the potential of the SDI for an intelligence agency would be superfluous; Air Intelligence has in fact been experimenting with such a mechanized dissemination system for some years.

Fifty (50) years later and the device that needs no description continues to elude us.

Is there a semantic equivalent to NP-complete?

Document Mining with Overview:…

Filed under: Document Classification,Document Management,News,Reporting,Text Mining — Patrick Durusau @ 5:24 pm

Document Mining with Overview:… A Digital Tools Tutorial by Jonathan Stray.

The slides from the Overview presentation I mentioned yesterday.

One of the few webinars I have ever attended where nodding off was not a problem! Interesting stuff.

It is designed for the use case where there “…is too much material to read on deadline.”

A cross between document mining and document management.

A cross that hides a lot of the complexity from the user.

Definitely a project to watch.

YSmart: Yet Another SQL-to-MapReduce Translator

Filed under: MapReduce,SQL — Patrick Durusau @ 4:30 pm

YSmart: Yet Another SQL-to-MapReduce Translator by Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, Xiaodong Zhang.

Abstract:

MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to-MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.

Just in case you aren’t plicking the videos for this weekend.

Alex Popescus points this paper out at: Paper: YSmart – Yet Another SQL-to-MapReduce Translator.

Using Solr’s New Atomic Updates

Filed under: Indexing,Solr — Patrick Durusau @ 4:08 pm

Using Solr’s New Atomic Updates by Scott Stults.

From the post:

A while ago we created a sample index of US patent grants roughly 700k documents big. Adjacently we pulled down the corresponding multi-page TIFFs of those grants and made PNG thumbnails of each page. So far, so good.

You see, we wanted to give our UI the ability to flip through those thumbnails and we wanted it to be fast. So our original design had a client-side function that pulled down the first thumbnail and then tried to pull down subsequent thumbnails until it ran out of pages or cache. That was great for a while, but it didn’t scale because a good portion of our requests were for non-existent resources.

Things would be much better if the UI got the page count along with the other details of the search hits. So why not update each record in Solr with that?

A new feature in Solr and one that I suspect will be handy. Such as updating a index of associations, for example.

A Peek Behind the Neo4j Lucene Index Curtain

Filed under: Indexing,Lucene,Neo4j — Patrick Durusau @ 4:02 pm

A Peek Behind the Neo4j Lucene Index Curtain by Max De Marzi.

Max suggests using a copy of your Neo4j database for this exercise.

Could be worth your while to go exploring.

And you will learn something about Lucene in the bargain.

Cypher for SQL Professionals

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 3:48 pm

Cypher for SQL Professionals by Michael Hunger. (video)

From the webpage:

Cypher is a graph query language used in Neo4j. Much like SQL, it’s a declarative language used for querying databases.

What does a join look like in Cypher? What about an left outer join? There are a lot of similarities between the two, Cypher is heavily influenced by SQL. We’ll talk about what these common concepts and differences are, what Cypher gives you that SQL leaves you wanting. Come along and see how Neo4j and Cypher can make your daily grind much easier and more fun.

You’ll learn how to use your current SQL knowledge to quickly get started using Neo4j and Cypher.

Just in case you get to pick the video for this weekend. 😉

This is work friendly so you might better save it for a lunch break next week.

February NYC DataKind Meetup (video)

Filed under: DataKind,Public Data — Patrick Durusau @ 1:35 pm

February NYC DataKind Meetup (video)

From the post:

A video of our February NYC DataKind Meetup is online for those of you who couldn’t join us in New York. Hear about the projects our amazing Data Ambassadors are working on with Medic Mobile, Sunlight Foundation, and Refugees United as well as listen to Anoush Tatevossian from the UN Global Pulse talk about how the UN is using data for the greater good. It was a fantastic event and we’re thrilled to get to share it with all of you.

A great pre-meeting format, beer first and during the presentations.

Need to recommend that format to Balisage.

None for the speaker, they could be the “designated driver” before and during their presentation.

HBaseCon 2013

Filed under: Conferences,Hadoop — Patrick Durusau @ 12:45 pm

HBaseCon 2013


Abstracts are due by midnight on April 1, 2013.

Conference: Thursday, June 13, 2013
San Francisco Marriott Marquis

From the webpage:

Early Bird registration is now open (until April 23), and we’re asking all members of the community to submit abstracts for sessions pertaining to:

  • HBase internals and futures
  • Best practices for running HBase in production
  • HBase use cases and applications
  • How to contribute to HBase

Abstracts are due by midnight on April 1, 2013. You will be notified by the Program Committee about your proposal’s status by April 15, 2013.

Waiting for all the components in the Hadoop ecosystem to have separate but co-located conferences. That would be radically cool!

Netflix Cloud Prize [$10K plus other stuff]

Filed under: Contest,Marketing,Topic Maps — Patrick Durusau @ 12:29 pm

Netflix Cloud Prize

Duration of Contest: 13th March 2013 to 15th September 2013.

From github:

This contest is for software developers.

Step 0 – You need your own GitHub account

Step 1 – Read the rules in the Wiki

Step 2 – Fork this repo to your own GitHub account

Step 3 – Send us your email address

Step 4 – Modify your copy of the repo as your Submission

Categories/Prizes:

We want you to build something cool using or modifying our open source software. Your submission will be a standalone program or a patch for one of our open source projects. Your submission will be judged in these categories:

  1. Best Example Application Mash-Up

  2. Best New Monkey

  3. Best Contribution to Code Quality

  4. Best New Feature

  5. Best Contribution to Operational Tools, Availability, and Manageability

  6. Best Portability Enhancement

  7. Best Contribution to Performance Improvements

  8. Best Datastore Integration

  9. Best Usability Enhancement

  10. Judges Choice Award

If you win, you’ll get US$10,000 cash, US$5000 AWS credits, a trip to Las Vegas for two, a ticket to Amazon’s user conference, and fame and notoriety (at least within Netflix Engineering).

I can see several of those categories where topic maps would make a nice fit.

You?

Yes, I have an ulterior motive. Having topic maps underlying one or more winners or even runners-up in this contest would promote topic maps and gain needed visibility.

I first saw this at: $10k prizes up for grabs in Netflix cloud contest by Elliot Bentley.

March 14, 2013

DRM/WWW, Wealth/Salvation: Theological Parallels

Filed under: DRM,RDF,Semantic Web — Patrick Durusau @ 7:38 pm

Cory Doctorow misses a teaching moment in his: What I wish Tim Berners-Lee understood about DRM.

Cory says:

Whenever Berners-Lee tells the story of the Web’s inception, he stresses that he was able to invent the Web without getting any permission. He uses this as a parable to explain the importance of an open and neutral Internet.

The “…without getting any permission” was a principle for Tim Berners-Lee when he was inventing the Web.

A principle then, not now.

Evidence? The fundamentals of RDF have been mired in the same model for fourteen (14) years. Impeding the evolution of the “Semantic” Web. Whatever its merits.

Another example? HTML5 violates prior definitions of URL in order to widen the reach of HTML5. (URL Homonym Problem: A Topic Map Solution)

Same “principle” as DRM support, expanding the label of “WWW” beyond what early supporters would recognize as the WWW.

HTML5 rewriting of URL and DRM support are membership building exercises.

The teaching moment comes from early Christian history.

You may (or may not) recall the parable of the rich young ruler (Matthew 19:16-30), where a rich young man asks Jesus what he must do to be saved?

Jesus replies:

One thing you still lack. Sell all that you have and distribute to the poor, and you will have treasure in heaven; and come, follow me.

And for the first hundred or more years of Christianity, so far as can be known, that rule, divesting yourself of property was followed.

Until, Clement of Alexandria. Clement took the position that indeed the rich could retain their goods, so long as they used it charitably. (Now there’s a loophole!)

Created two paths to salvation, one for anyone foolish enough to take the Bible at its word and another for anyone would wanted to call themselves Christians, without any inconvenience or discomfort.

Following Clement of Alexandria, Tim Berners-Lee is creating two paths to the WWW.

One for people who are foolish enough to innovate and share information, the innovation model of the WWW that Cory speaks so highly of.

Another path for people (DRM crowd) who neither spin nor toil but who want to burden everyone who does.

Membership as a principle isn’t surprising considering how TBL sees himself in the mirror:

TBL as WWW Pope

The Most Expensive Fighter Jet Ever Built, by the Numbers

Filed under: Defense,Government,News,Reporting — Patrick Durusau @ 7:35 pm

The Most Expensive Fighter Jet Ever Built, by the Numbers by Theodoric Meyer.

From the post:

Thanks to the sequester, the Defense Department is now required to cut more than $40 billion this fiscal year out of its $549 billion budget. But one program that’s unlikely to take a significant hit is the F-35 Joint Strike Fighter, despite the fact that it’s almost four times more expensive than any other Pentagon weapons program that’s in the works.

We’ve compiled some of the most headache-inducing figures, from the program’s hefty cost overruns to the billions it’s generating in revenue for Lockheed Martin.

[See the post for the numbers, which are impressive.]

While the F-35 is billions over budget and years behind schedule, the program seems to be doing better recently. A Government Accountability Office report released this week found that Lockheed has made progress in improving supply and manufacturing processes and addressing technical problems.

“We’ve made enormous progress over the last few years,” Steve O’Bryan, Lockheed’s vice president of F-35 business development, told the Washington Post.

The military’s current head of the program, Lt. Gen. Christopher Bogdan, agreed that things have improved but said Lockheed and another major contractor, Pratt & Whitney, still have a ways to go.

“I want them to take on some of the risk of this program,” Bogdan said last month in Australia, which plans to buy 100 of the planes. “I want them to invest in cost reductions. I want them to do the things that will build a better relationship. I’m not getting all that love yet.”

A story that illustrates the utility of a topic map approach to news coverage.

The story has already spanned more than a decade and language like: “[t]he military’s current head of the program…,” makes me wonder about the prior military heads of the program.

Or for that matter, it isn’t really Lockheed or Pratt & Whitney, that are building (allegedly) the F-35 but identifiable teams of people within those organizations.

And those companies are paying bonuses, stock dividends, etc. during the term of the project.

No one person or for that matter any one group of people could not chase down all the actors in a story like this one.

However, merging different investigations into distinct aspects of the story could assemble a mosaic clearer than any of its individual pieces.

Perhaps tying poor management, cost overruns, etc., to named individuals will have a greater impact than generalized stories about such practices have when the name is the DoD, Lockheed, etc.


PS: If you aren’t clinically depressed, read the GAO report.

Would you buy a plane where it isn’t known if the helmet mounted display, a critical control system, will work?

It’s like buying a car where a working engine is to-be-determined, maybe.

An F-35 topic map should start with the names, addresses and current status of everyone who signed any paperwork authorizing this project.

Leading People to Longer Queries

Filed under: Interface Research/Design,Search Behavior,Search Interface,Searching — Patrick Durusau @ 6:55 pm

Leading People to Longer Queries by Elena Agapie, Gene Golovchinsky, Pernilla Qvarfordt.

Abstract:

Although longer queries can produce better results for information seeking tasks, people tend to type short queries. We created an interface designed to encourage people to type longer queries, and evaluated it in two Mechanical Turk experiments. Results suggest that our interface manipulation may be effective for eliciting longer queries.

The researchers encouraged longer queries by varying a halo around the search box.

Not conclusive but enough evidence to ask the questions:

What does your search interface encourage?

What other ways could you encourage query construction?

How would you encourage graph queries?

I first saw this in a tweet by Gene Golovchinsky.

Black Hat USA 2013

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:29 pm

Black Hat USA 2013

Deadline for submission: April 15, 2013 (But may close earlier if enough quality submission are received.)

Conference: July 27, 2013 – August 1, 2013.

Probably the best call for papers you will see all year:

WHAT ARE THE BLACK HAT BRIEFINGS?

The Black Hat Briefings was created to fill the need for computer security professionals to better understand the security risks to information infrastructures and computer systems. Black Hat accomplishes this by assembling a group of vendor-neutral security professionals and having them speak candidly about the problems businesses and Governments face as well as potential solutions to those problems. No gimmicks – just straight talk by people who make it their business to know the information security space. The following timeslots are available: 25, 50 and 120 minutes (please note the 120 minute timeslots are only available for workshops).

as opposed to long list of eligible topics.

You doubt your topic will be acceptable, then it probably isn’t.

Subject identity is a thread that runs through out cybersecurity.

From repeating the same software flaws under different names, not catching the same software flaws in testing and failing to repair/exploit flaws known by other identities, are just some of the issue areas.

To say nothing of mapping the diverse literature on cybersecurity.

Visualizing the Topical Structure of the Medical Sciences:…

Filed under: Medical Informatics,PubMed,Self Organizing Maps (SOMs),Text Mining — Patrick Durusau @ 2:48 pm

Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach by André Skupin, Joseph R. Biberstine, Katy Börner. (Skupin A, Biberstine JR, Börner K (2013) Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach. PLoS ONE 8(3): e58779. doi:10.1371/journal.pone.0058779)

Abstract:

Background

We implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues.

Methodology

Documents extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains.

Conclusions

Study results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid.

Impressive work to say the least!

But I was just as impressed by the future avenues for research:

Controlled Vocabularies

It appears that the use of indexer-chosen keywords, including in the case of a large controlled vocabulary-MeSH terms in this study-raises interesting questions. The rank transition diagram in particular helped to highlight the fact that different vocabulary items play different roles in indexers’ attempts to characterize the content of specific publications. The complex interplay of hierarchical relationships and functional roles of MeSH terms deserves further investigation, which may inform future efforts of how specific terms are handled in computational analysis. For example, models constructed from terms occurring at intermediate levels of the MeSH hierarchy might look and function quite different from the top-level model presented here.

User-centered Studies

Future user studies will include term differentiation tasks to help us understand whether/how users can differentiate senses of terms on the self-organizing map. When a term appears prominently in multiple places, that indicates multiple senses or contexts for that term. One study might involve subjects being shown two regions within which a particular label term appears and the abstracts of several papers containing that term. Subjects would then be asked to rate each abstract along a continuum between two extremes formed by the two senses/contexts. Studies like that will help us evaluate how understandable the local structure of the map is.

There are other, equally interesting future research questions but those are the two of most interest to me.

I take this research as evidence that managing semantic diversity is going to require human effort, augmented by automated means.

I first saw this in Nat Torkington’s Four short links: 13 March 2013.

“Mixed Messages” on Cybersecurity [China ranks #12 among cyber-attackers]

Filed under: Cybersecurity,Government,Government Data,News,Reporting,Security — Patrick Durusau @ 9:39 am

Do you remember the “mixed messages” Dibert cartoon?

Mixed Messages

Where an “honest” answer meant “mixed messages?”

I had that feeling this morning when I read: Mark Rockwell’s post: German telecom company provides real-time map of Cyber attacks.

From the post:

In hopes of blunting mounting electronic assaults, a German telecommunications carrier unveiled a free online capability that shows where Cyber attacks are happening around the world in real time.

Deutsche Telekom, parent company of T-Mobile, put up what it calls its “Security dashboard” portal on March 6. The map, said the company, is based on attacks on its purpose-built network of decoy “honeypot” systems at 90 locations worldwide

Deutsche Telekom said it launched the online portal at the CeBIT telecommunications trade show in Hanover, Germany, to increase the visibility of advancing electronic threats.

“New cyber attacks on companies and institutions are found every day. Deutsche Telekom alone records up to 450,000 attacks per day on its honeypot systems and the number is rising. We need greater transparency about the threat situation. With its security radar, Deutsche Telekom is helping to achieve this,” said Thomas Kremer, board member responsible for Data Privacy, Legal Affairs and Compliance.

Which has a handy chart of the sources of attacks over the last month:

Top 15 of Source Countries (Last month)

Source of Attack Number of Attacks
Russia Russian Federation 2,402,722
Taiwan, Province of China 907,102
Germany 780,425
Ukraine 566,531
Hungary 367,966
United States 355,341
Romania 350,948
Brazil 337,977
Italy 288,607
Australia 255,777
Argentina 185,720
China 168,146
Poland 162,235
Israel 143,943
Japan 133,908

By measured “attacks,” the geographic location of China (not the Chinese government) is #12 as an origin of cyber-attacks.

After Russia, Taiwan (Province of China), Germany, Ukraine, Hungary, United States, and others.

Just in case you missed several recent news cycles, the Chinese government was being singled out as a cyber-attacker for policy or marketing reasons that are not clear.

This service makes the specious nature of those accusations apparent, although the motivations behind the reports remains unclear.

Before you incorporate any government data or report into a topic map, you should verify the information with at least two or more independent sources.

#Tweets4Science

Filed under: Data,Government,Tweets — Patrick Durusau @ 9:35 am

#Tweets4Science

From the manifesto:

User generated content has experienced an explosive growth both in the diversity of available services and the volume of topics covered by the users. Content published in micro-blogging sites such as Twitter is a rich, heterogeneous, and, above all, huge sample of the daily musings of our fellow citizens across the world.

Once qualified as inane chatter, more and more researchers are turning to Twitter data to better understand our social behavior and, no doubt, that global chatter will provide a first person account of our times to future historians.

Thus, initiatives such as the one lead by the Library of the US Congress to collect the entire Twitter Archive are laudable. However, as of today, no researcher has been granted access to that archive, there is no estimation on when such access would be possible and, on top of that, access would only be granted on site.

As researchers we understand the legal compromises one must reach with private sector, and we understand that it is fair that Twitter and resellers offer access to Twitter data, including historical data, for a fee (a rather large one, by the way). However, without the data provided by each of Twitter users such a business would be impossible and, hence, we believe that such data belongs to the users individually and as a group.

Includes links on how to download and donate your tweets.

The researchers appeal to altruism, aggregating your tweets with others may advance human knowledge.

I have a much more pragmatic reason:

While I trust the Library of Congress, I don’t trust their pay masters.

Not to sound paranoid but the delay in anyone accessing the twitter data at the Library of Congress seems odd. The astronomy community has been providing access to much larger data sets long before the first tweet.

So why is it taking so long?

While we are waiting on multiple versions of that story, download your tweets and donate them to this project.

Worldwide Threat Assessment…

Filed under: Cybersecurity,Intelligence,Security — Patrick Durusau @ 9:35 am

Worldwide Threat Assessment of the US Intelligence Community, Senate Select Committee on Intelligence, James R. Clapper, Director of National Intelligence, March 12, 2013.

Thought you might be interested in the cybersecurity parts, marketing literature stuff if your interests lie towards security issues.

It has tidbits like this one:

Foreign intelligence and security services have penetrated numerous computer networks of US Government, business, academic, and private sector entities. Most detected activity has targeted unclassified networks connected to the Internet, but foreign cyber actors are also targeting classified networks. Importantly, much of the nation’s critical proprietary data are on sensitive but unclassified networks; the same is true for most of our closest allies. (emphasis added)

Just curious, if you discovered your retirement funds were in your mail box, would you move them to a more secure location?

Depending on the products or services you are selling, the report may have other marketing information.

I first saw this in a tweet by Jeffrey Carr.

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop

Filed under: Data Structures,Hadoop,Parquet — Patrick Durusau @ 9:35 am

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop by Justin Kestelyn.

From the post:

We’d like to introduce a new columnar storage format for Hadoop called Parquet, which started as a joint project between Twitter and Cloudera engineers.

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

Parquet is built from the ground up with complex nested data structures in mind. We adopted the repetition/definition level approach to encoding such data structures, as described in Google’s Dremel paper; we have found this to be a very efficient method of encoding data in non-trivial object schemas.

Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing Parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

Under heavy development so watch closely!

Document Mining with Overview:… [Webinar – March 15, 2013]

Filed under: News,Reporting,Text Mining — Patrick Durusau @ 9:34 am

Document Mining with Overview: A Digital Tools Tutorial

From the post:

Friday, March 15, 2013 at 2:00pm Eastern Time Enroll Now

Overview is a free tool for journalists that automatically organizes a large set of documents by topic, and displays them in an interactive visualization for exploration, tagging, and reporting. Journalists have already used it to report on FOIA document dumps, emails, leaks, archives, and social media data. In fact it will work on any set of documents that is mostly text. It integrates with DocumentCloud and can import your projects, or you can upload data directly in CSV form.

You can’t read 10,000 pages on deadline, but Overview can help you rapidly figure out which pages are the important ones — even if you’re not sure what you’re looking for.

This training event is part of a series on digital tools in partnership with the American Press Institute and The Poynter Institute, funded by the John S. and James L. Knight Foundation.

See more tools in the Digital Tools Catalog.

I have been meaning to learn more about “Overview” and this looks like a good opportunity.

March 13, 2013

Hiding in Plain Sight/Being Secure From The NSA

Filed under: Cryptography,Cybersecurity,Intelligence,Security — Patrick Durusau @ 3:15 pm

I presume that if a message can be “overhear,” electronically or otherwise, it is likely the NSA and other “fictional” groups are capturing it.

The use of encryption marks you as a possible source of interest.

You can use image-based steganography to conceal messages but that requires large file sizes and is subject to other attacks.

Professor Abdelrahman Desoky of the University of Maryland in Baltimore County, USA, suggests that messages can be hidden in plain sight, but changing the wording of jokes to carry a secret message.

Desoky suggests that instead of using a humdrum text document and modifying it in a codified way to embed a secret message, correspondents could use a joke to hide their true meaning. As such, he has developed an Automatic Joke Generation Based Steganography Methodology (Jokestega) that takes advantage of recent software that can automatically write pun-type jokes using large dictionary databases. Among the automatic joke generators available are: The MIT Project, Chuck Norris Joke Generator, Jokes2000, The Joke Generator dot Com and the Online Joke Generator System (pickuplinegen).

A simple example might be to hide the code word “shaking” in the following auto-joke. The original question and answer joke is “Where do milk shakes come from?” and the correct answer would be “From nervous cows.” So far, so funny. But, the system can substitute the word “shaking” for “nervous” and still retain the humor so that the answer becomes “From shaking cows.” It loses some of its wit, but still makes sense and we are not all Bob Hopes, after all. [Hiding Secret Messages in Email Jokes]

Or if you prefer the original article abstract:

This paper presents a novel steganography methodology, namely Automatic Joke Generation Based Steganography Methodology (Jokestega), that pursues textual jokes in order to hide messages. Basically, Jokestega methodology takes advantage of recent advances in Automatic Jokes Generation (AJG) techniques to automate the generation of textual steganographic cover. In a corpus of jokes, one may judge a number of documents to be the same joke although letters, locations, and other details are different. Generally, joke and puns could be retold with totally different vocabulary, while still retaining their identities. Therefore, Jokestega pursues the common variations among jokes to conceal data. Furthermore, when someone is joking, anything may be said which legitimises the use of joke-based steganography. This makes employing textual jokes very attractive as steganographic carrier for camouflaging data. It is worth noting that Jokestega follows Nostega paradigm, which implies that joke-cover is noiseless. The validation results demonstrate the effectiveness of Jokestega. is only available to individual subscribers or to users at subscribing institutions. [Jokestega: automatic joke generation-based steganography methodology by Abdelrahman Desoky. International Journal of Security and Networks (IJSN), Vol. 7, No. 3, 2012]

If you are interested, other publications by Professor Desoky are listed here.

Occurs to me that topic maps offer the means to create steganography chains over public channels. The sender may know its meaning but there can be several links in the chain of transmission that change the message but have no knowledge of its meaning. And/or that don’t represent traceable links in the chain.

With every “hop” and/or mapping of the terms to another vocabulary, the task of statistical analysis grows more difficult.

Not the equivalent of highly secure communication networks, the contents of which can be copied onto a Lady Gaga DVD, but then not everyone needs that level of security.

Some people need cheaper but more secure systems for communication.

Will devote some more thought to the outline of a topic map system for hiding content in plain sight.

squirro

Filed under: Data,Filters,Findability,Searching — Patrick Durusau @ 3:14 pm

squirro

I am not sure how “hard” the numbers are but CRM application claims:

Up to 15% increase in revenues

66% less time wasted on finding and re-finding information

15% increase in win rates

I take this as evidence there is a market for less noisy data streams.

If filtered search can produce this kind of ROI, imagine what curated search can do.

Yes?

Aaron Swartz’s A Programmable Web: An Unfinished Work

Filed under: Semantic Web,Semantics,WWW — Patrick Durusau @ 3:04 pm

Aaron Swartz’s A Programmable Web: An Unfinished Work

Abstract:

This short work is the first draft of a book manuscript by Aaron Swartz written for the series “Synthesis Lectures on the Semantic Web” at the invitation of its editor, James Hendler. Unfortunately, the book wasn’t completed before Aaron’s death in January 2013. As a tribute, the editor and publisher are publishing the work digitally without cost.

From the author’s introduction:

” . . . we will begin by trying to understand the architecture of the Web — what it got right and, occasionally, what it got wrong, but most importantly why it is the way it is. We will learn how it allows both users and search engines to co-exist peacefully while supporting everything from photo-sharing to financial transactions.

We will continue by considering what it means to build a program on top of the Web — how to write software that both fairly serves its immediate users as well as the developers who want to build on top of it. Too often, an API is bolted on top of an existing application, as an afterthought or a completely separate piece. But, as we’ll see, when a web application is designed properly, APIs naturally grow out of it and require little effort to maintain.

Then we’ll look into what it means for your application to be not just another tool for people and software to use, but part of the ecology — a section of the programmable web. This means exposing your data to be queried and copied and integrated, even without explicit permission, into the larger software ecosystem, while protecting users’ freedom.

Finally, we’ll close with a discussion of that much-maligned phrase, ‘the Semantic Web,’ and try to understand what it would really mean.”

Table of Contents: Introduction: A Programmable Web / Building for Users: Designing URLs / Building for Search Engines: Following REST / Building for Choice: Allowing Import and Export / Building a Platform: Providing APIs / Building a Database: Queries and Dumps / Building for Freedom: Open Data, Open Source / Conclusion: A Semantic Web?

Even if you disagree with Aaron, on issues both large and small, as I do, it is a very worthwhile read.

But I will save my disagreements for another day. Enjoy the read!

SURAAK – When Search Is Not Enough [A “google” of search results, new metric]

Filed under: Health care,Medical Informatics,Searching — Patrick Durusau @ 2:21 pm

SURAAK – When Search Is Not Enough (video)

A new way to do research. SURAAK is a web application that uses natural language processing techniques to analyze big data of published healthcare articles in the area of geriatrics and senior care. See how SURAAK uses text causality to find and analyze word relationship is this and other areas of interest.

SURAAK = Semantic Understanding Research in the Automatic Acquisition of Knowledge.

NLP based system that extracts “causal” sentences.

Differences from Google (according to the video)

  • Extracts text from PDFs
  • Links concepts together building relationships found in extracted text
  • Links articles together based on shared concepts

Search demo was better than using Google but that’s not hard to do.

The “notes” that are extracted from texts are sentences.

I am uneasy about the use of sentences in isolation from the surrounding text as a “note.”

It’s clearly “doable,” but whether it is a good idea, remains to be seen. Particularly since users are rating sentences/notes in isolation from the text in which they occur.

BTW, funded with tax dollars from the National Institutes of Health and the National Institute on Aging, to the tune of $844K.

I am still trying to track down the resulting software.

I take this as an illustration that anything over a “google” of search results (a new metric), is of interest and fundable.

A map of worldwide email traffic, created with R

Filed under: Mapping,Maps,R — Patrick Durusau @ 1:00 pm

A map of worldwide email traffic, created with R by David Smith.

The Washing Post reports that by analyzing more than 10 million emails sent through the Yahoo! Mail service in 2012, a team of researchers used the R language to create a map of countries whose citizens email each other most frequently:

Worldwide Email traffic

Some discussion of Huntington’s Clash of Civilizations, but I have a different question:

If a map is a snapshot of a territory, can’t a later snapshot might show changes to the same territory?

Rather than debating Huntington and his money making but shallow view of the world and its history, why not intentionally broaden the communication network you see above?

A map, even a topic map, isn’t destiny, it’s a guide to finding a path to a new location or information.

Pan-European open data…

Filed under: EU,Geographic Data,GIS — Patrick Durusau @ 12:44 pm

Pan-European open data available online from EuroGeographics

From the post:

Data compiled from national mapping supplied by 45 European countries and territories can now be downloaded for free at http://www.eurogeographics.org/form/topographic-data-eurogeographics.

From today (8 March 2013), the 1:1 million scale topographic dataset, EuroGlobalMap will be available free of charge for any use under a new open data licence. It is produced using authoritative geo-information provided by members of EuroGeographics, the Association for European Mapping, Cadastre and Land Registry Authorities.

….

“World leaders acknowledge the need for further mainstream sustainable development at all levels, integrating economic, social and environmental aspects and recognising their inter-linkages,” she said. [EuroGeographics’ President, Ingrid Vanden Berghe]

“Geo-information is key. It provides a vital link among otherwise unconnected information and enables the use of location as the basis for searching, cross-referencing, analysing and understanding Europe-wide data.”

Geographic location is a common binding point for information.

Interesting to think about geographic steganography. Right latitude but wrong longitude, or other variations.

« Newer PostsOlder Posts »

Powered by WordPress