Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 14, 2012

An XML-Format for Conjectures in Geometry (Work-in-Progress)

Filed under: Geometry,Indexing,Keywords,Mathematical Reasoning,Mathematics,Ontology — Patrick Durusau @ 10:33 am

An XML-Format for Conjectures in Geometry (Work-in-Progress) by Pedro Quaresma.

Abstract:

With a large number of software tools dedicated to the visualisation and/or demonstration of properties of geometric constructions and also with the emerging of repositories of geometric constructions, there is a strong need of linking them, and making them and their corpora, widely usable. A common setting for interoperable interactive geometry was already proposed, the i2g format, but, in this format, the conjectures and proofs counterparts are missing. A common format capable of linking all the tools in the field of geometry is missing. In this paper an extension of the i2g format is proposed, this extension is capable of describing not only the geometric constructions but also the geometric conjectures. The integration of this format into the Web-based GeoThms, TGTP and Web Geometry Laboratory systems is also discussed.

The author notes open questions as:

  • The xml format must be complemented with an extensive set of converters allowing the exchange of information between as many geometric tools as possible.
  • The databases queries, as in TGTP, raise the question of selecting appropriate keywords. A fine grain index and/or an appropriate geometry ontology should be addressed.
  • The i2gatp format does not address proofs. Should we try to create such a format? The GATPs produce proofs in quite different formats, maybe the construction of such unifying format it is not possible and/or desirable in this area.

The “keywords,” “fine grained index,” “geometry ontology,” question yells “topic map” to me.

You?

PS: Converters and different formats also say “topic map,” just not as loudly to me. Your volume may vary. (YVMV)

.

Finding Structure in Text, Genome and Other Symbolic Sequences

Filed under: Genome,Statistics,Symbol,Text Analytics,Text Corpus,Text Mining — Patrick Durusau @ 8:58 am

Finding Structure in Text, Genome and Other Symbolic Sequences by Ted Dunning. (thesis, 1998)

Abstract:

The statistical methods derived and described in this thesis provide new ways to elucidate the structural properties of text and other symbolic sequences. Generically, these methods allow detection of a difference in the frequency of a single feature, the detection of a difference between the frequencies of an ensemble of features and the attribution of the source of a text. These three abstract tasks suffice to solve problems in a wide variety of settings. Furthermore, the techniques described in this thesis can be extended to provide a wide range of additional tests beyond the ones described here.

A variety of applications for these methods are examined in detail. These applications are drawn from the area of text analysis and genetic sequence analysis. The textually oriented tasks include finding interesting collocations and cooccurent phrases, language identification, and information retrieval. The biologically oriented tasks include species identification and the discovery of previously unreported long range structure in genes. In the applications reported here where direct comparison is possible, the performance of these new methods substantially exceeds the state of the art.

Overall, the methods described here provide new and effective ways to analyse text and other symbolic sequences. Their particular strength is that they deal well with situations where relatively little data are available. Since these methods are abstract in nature, they can be applied in novel situations with relative ease.

Recently posted but dating from 1998.

Older materials are interesting because the careers of their authors can be tracked, say at DBPL Ted Dunning.

Or it can lead you to check an author in Citeseer:

Accurate Methods for the Statistics of Surprise and Coincidence (1993)

Abstract:

Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results.This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text.However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.

Which has over 600 citations, only one of which is from the author. (I could comment about a well know self-citing ontologist but I won’t.)

The observations in the thesis about “large” data sets are dated but it merits your attention as fundamental work in the field of textual analysis.

As a bonus, it is quite well written and makes an enjoyable read.

July 13, 2012

Broccoli: Semantic Full-Text Search at your Fingertips

Filed under: Broccoli,Ontology,Search Engines,Search Interface,Searching,Semantic Search — Patrick Durusau @ 5:51 pm

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, and Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g. edible), classes (e.g. plants), instances (e.g. Broccoli), and relations (e.g. occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully-functional prototype based on our ideas, see this http URL

It’s good to see CS projects work so hard to find unambiguous names. That won’t be confused with far more common uses of the same names. 😉

For all that, on quick review it does look like a clever, if annoyingly named, project.

Hmmm, doesn’t like the “-” (hyphen) character. “graph-theoretical tree” returns 0 results, “graph theoretical tree” returns 1 (the expected one).

Definitely worth a close read.

One puzzle though. There are a number of projects that use Wikipedia data dumps. The problem is most of the documents I am interested in searching aren’t in Wikipedia data dumps. Like the Enron emails.

Techniques that work well with clean data may work less well with documents composed of the vagaries of human communication. Or attempts at communication.

Search Algorithms for Conceptual Graph Databases

Filed under: Algorithms,Graph Databases,Search Algorithms — Patrick Durusau @ 5:10 pm

Search Algorithms for Conceptual Graph Databases by Abdurashid Mamadolimov.

Abstract:

We consider a database composed of a set of conceptual graphs. Using conceptual graphs and graph homomorphism it is possible to build a basic query-answering mechanism based on semantic search. Graph homomorphism defines a partial order over conceptual graphs. Since graph homomorphism checking is an NP-Complete problem, the main requirement for database organizing and managing algorithms is to reduce the number of homomorphism checks. Searching is a basic operation for database manipulating problems. We consider the problem of searching for an element in a partially ordered set. The goal is to minimize the number of queries required to find a target element in the worst case. First we analyse conceptual graph database operations. Then we propose a new algorithm for a subclass of lattices. Finally, we suggest a parallel search algorithm for a general poset.

While I have no objection to efficient solutions for particular cases, as a general rule those solutions are valid for some known set of cases.

Here we appear to have an efficient solution for some unknown number of cases. I mention it to keep in mind while watching the search literature on graph databases develop.

HBase Log Splitting

Filed under: HBase — Patrick Durusau @ 4:53 pm

HBase Log Splitting by Jimmy Xiang.

When I was being certified in NetWare many years ago, the #1 reason for dismissal of sysadmins was failure to maintain backups. I don’t know what the numbers are like now but suspect it is probably about the same. But in any event, you are less likely to be fired if you don’t lose data when your servers fail. That’s just a no-brainer.

If you are running HBase, take the time to review this post and make sure the job that gets lost isn’t yours.

From the post:

In the recent blog post about the HBase Write Path, we talked about the write-ahead-log (WAL), which plays an important role in preventing data loss should a HBase region server failure occur. This blog post describes how HBase prevents data loss after a region server crashes, using an especially critical process for recovering lost updates called log splitting.
Log splitting

As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.

A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.

Log splitting is done by HMaster as the cluster starts or by ServerShutdownHandler as a region server shuts down. Since we need to guarantee consistency, affected regions are unavailable until data is restored. So we need to recover and replay all WAL edits before letting those regions become available again. As a result, regions affected by log splitting are unavailable until the process completes and any required edits are applied.

Titan Stress Poster [Government Comparison Shopping?]

Filed under: Amazon Web Services AWS,Titan — Patrick Durusau @ 4:45 pm

Titan Stress Poster from Marko A. Rodriguez.

Notice of a poster at GraphLab 2012 with Matthias Broecheler:

This poster presents an overview of Titan along with some excellent stress testing done by Matthias and Dan LaRoque. The stress test uses a 6 machine Titan cluster with 14 read/write servers slamming Titan with various read/writes. The results are presented in terms of the number of bytes being read/write from disk, the average runtime of the queries, the cost of a transaction on Amazon EC2, and a speculation of the number of concurrent users are concurrently interacting.

Being a poster you will have to pump up the size for legibility but I think you will like the poster.

Impressive numbers. Including the Amazon EC2 cost.

Makes me wonder when governments are going to start requiring cost comparisons for system bids versus use of Amazon EC2?

Hadoop: A Powerful Weapon for Retailers

Filed under: Analytics,Data Science,Hadoop — Patrick Durusau @ 4:15 pm

Hadoop: A Powerful Weapon for Retailers

From the post:

With big data basking in the limelight, it is no surprise that large retailers have been closely watching its development… and more power to them! By learning to effectively utilize big data, retailers can significantly mold the market to their advantage, making themselves more competitive and increasing the likelihood that they will come out on top as a successful retailer. Now that there are open source analytical platforms like Hadoop, which allow for unstructured data to be transformed and organized, large retailers are able to make smart business decisions using the information they collect about customers’ habits, preferences, and needs.

As IT industry analyst Jeff Kelly explained on Wikibon, “Big Data combined with sophisticated business analytics have the potential to give enterprises unprecedented insights into customer behavior and volatile market conditions, allowing them to make data-driven business decisions faster and more effectively than the competition.” Predicting what customers want to buy, without a doubt, affects how many products they want to buy (especially if retailers add on a few of those wonderful customer discounts). Not only will big data analytics prove financially beneficial, it will also present the opportunity for customers to have a more individualized shopping experience.

This all sounds very promising but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage. In order for retailers to thrive in the market, they must learn to manage and hone in on all (or at least most) of these facets of business, which can be difficult if you keep in mind the amount of data that each channel generates. Sam Sliman, president at Optimal Solutions Integration, summarizes it perfectly: “Transparency rules the day. Inconsistency turns customers away. Retailer missteps can be glaring and costly.” By making fast market decisions, retailers can increase sales, win and maintain customers, improve margins, and boost market share, but this can really only be done with the right business analytics tools.

Interesting but I disagree with “…but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage.”

That can be a difficulty, if you are not technically capable of effectively using information from different channels.

But there is a more fundamental difficulty. Having the capacity to use multiple channels of information is no guarantee of effective use of those channels of information.

You could buy your programming department a Cray supercomputer but that doesn’t mean they can make good use of it.

Same is true for collecting or having the software to process “big data.”

The real difficulty is the shortage of analytical skills to explore and exploit data. Computers and software can enhance but not create those analytical skills.

Analytical skills are powerful weapons for retailers.

ggplot2

Filed under: Ggplot2,Graphics,R,Visualization — Patrick Durusau @ 4:01 pm

ggplot2

From the webpage:

ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.

I have a few posts about ggplot2 but this site is the mother ship of information on it. Use other resources as necessary but this looks like the canonical source. (Plus you can download a local copy for your laptop. For the odd occasion when you are off net.)

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane

Filed under: e-Discovery,Email,Law,Prediction,Predictive Analytics — Patrick Durusau @ 3:47 pm

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane by Ralph Losey.

From the post:

Day One of the search project ended when I completed review of the initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and then, with the help of William Webber, go deeper into math and statistical sampling than ever before. I will also give you the big picture of my review plan and search philosophy: its hybrid and multimodal. Some search experts disagree with my philosophy. They think I do not go far enough to fully embrace machine coding. They are wrong. I will explain why and rant on in defense of humanity. Only then will I conclude with the Day Two narrative.

More than you are probably going to want to know about sample sizes and their calculation but persevere until you get to the defense of humanity stuff. It is all quite good.

If I had to add a comment on the defense of humanity rant, it would be that machines have a flat view of documents and not the richly textured one of a human reader. While true that machines can rapidly compare document without tiring, they will miss an executive referring to a secretary as his “cupcake.” A reference that would jump out at a human reader. Same text, different result.

Perhaps because in one case the text is being scanned for tokens and in the other case it is being read.

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron

Filed under: Email,Law,Prediction,Predictive Analytics — Patrick Durusau @ 3:22 pm

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron by Ralph Losey.

The start of a series of posts on predictive coding and searching of the Enron emails by a lawyer. A legal perspective is important enough that I will be posting a note about each post in this series as they occur.

A couple of preliminary notes:

I am sure this is the first time that Ralph has used predictive encoding with the Enron emails. On the other hand, I would not take “…this is the first time for X…” sort of claims from any vendor or service organization. 😉

You can see other examples of processing the Enron emails at:

And that is just a “lite” scan. There are numerous other projects that use the Enron email collection.

I wonder if that is because we are naturally nosey?

From the post:

This is the first in a series of narrative descriptions of a legal search project using predictive coding. Follow along while I search for evidence of involuntary employee terminations in a haystack of 699,082 Enron emails and attachments.

Joys and Risks of Being First

To the best of my knowledge, this writing project is another first. I do not think anyone has ever previously written a blow-by-blow, detailed description of a large legal search and review project of any kind, much less a predictive coding project. Experts on predictive coding speak only from a mile high perspective; never from the trenches (you can speculate why). That has been my practice here, until now, and also my practice when speaking about predictive coding on panels or in various types of conferences, workshops, and classes.

There are many good reasons for this, including the main one that lawyers cannot talk about their client’s business or information. That is why in order to do this I had to run an academic project and search and review the Enron data. Many people could do the same. In fact, each year the TREC Legal Track participants do similar search projects of Enron data. But still, no one has taken the time to describe the details of their search, not even the spacey TRECkies (sorry Jason).

A search project like this takes an enormous amount of time. In fact, to my knowledge (Maura, please correct me if I’m wrong), no Legal Track TRECkies have ever recorded and reported the time that they put into the project, although there are rumors. In my narrative I will report the amount of time that I put into the project on a day-by-day basis, and also, sometimes, on a per task basis. I am a lawyer. I live by the clock and have done so for thirty-two years. Time is important to me, even non-money time like this. There is also a not-insignificant amount of time it takes to write it up a narrative like this. I did not attempt to record that.

There is one final reason this has never been attempted before, and it is not trivial: the risks involved. Any narrator who publicly describes their search efforts assumes the risk of criticism from monday morning quarterbacks about how the sausage was made. I get that. I think I can handle the inevitable criticism. A quote that Jason R. Baron turned me on to a couple of years ago helps, the famous line from Theodore Roosevelt in his Man in the Arena speech at the Sorbonne:

It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.

I know this narrative is no high achievement, but we all do what we can, and this seems within my marginal capacities.

Feds Look to Fight Leaks With ‘Fog of Disinformation’

Filed under: Humor,Security — Patrick Durusau @ 8:18 am

Feds Look to Fight Leaks With ‘Fog of Disinformation’

From the Wired Story:

Pentagon-funded researchers have come up with a new plan for busting leakers: Spot them by how they search, and then entice the secret-spillers with decoy documents that will give them away.

Computer scientists call it it “Fog Computing” — a play on today’s cloud computing craze. And in a recent paper for Darpa, the Pentagon’s premiere research arm, researchers say they’ve built “a prototype for automatically generating and distributing believable misinformation … and then tracking access and attempted misuse of it. We call this ‘disinformation technology.’”

Two small problems: Some of the researchers’ techniques are barely distinguishable from spammers’ tricks. And they could wind up undermining trust among the nation’s secret-keepers, rather than restoring it.

There is a third problem as well: What about lobbyists, members of Congress, to say nothing of the Executive Branch who develop and lobby for policies based on information in decoy documents? No unauthorized disclosure but wasted effort based on bogus information. As distinguished from wasted effort on non-bogus information.

After the assassination of Osama bin Laden, there was an agreement among an identifiable group of executive branch officials on no detailed leaks. Next day, detailed leaks. Don’t need disinformation to know where to start rendering suspects on that one.

If they are serious about tracking leaks, whether to encourage (one department trying to discredit another) or discourage them (unlikely other than to avoid bad press/transparency), may I suggest using a topic map? Best way to follow unstructured information trails.

On the other side, to be fair, people leaking or using leaked information can use topic maps to avoid over-use of particular sources or information that can only be tracked to particular sources. Or intentionally developing information to identify (falsely), particular administration officials as the sources of information.

Parsing the Newick format in C using flex and bison

Filed under: Bioinformatics,Graphs,Trees — Patrick Durusau @ 5:47 am

Parsing the Newick format in C using flex and bison by Pierre Lindenbaum.

From the post:

The following post is my answer for this question on biostar “Newick 2 Json converter“.

The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .

The original question asked for a parser based on perl but here, I’ve implemented a C parser using flex/bison.

If that doesn’t grab your interest, consider the following from the Wikipedia article cited by Pierre on the Newick tree format:

In mathematics, Newick tree format (or Newick notation or New Hampshire tree format) is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at Newick’s restaurant in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein’s PHYLIP package.[1]

Of interest both for conversion but also for the representation of graph-theoretical trees. About the same time as GML and other efforts on trees.

In case you are in Dover, Newick’s survives to this day. I don’t know if they are aware of the reason for their fame but you could mention it.

July 12, 2012

LSU Researchers Create Topic Map of Oil Spill Disaster

Filed under: Marketing,Topic Map Software,Topic Maps — Patrick Durusau @ 6:53 pm

LSU Researchers Create Topic Map of Oil Spill Disaster

From the post:

The Gulf of Mexico Deepwater Horizon Oil Spill incident has impacted many aspects of the coastal environment and inhabitants of surrounding states. However, government officials, Gulf-based researchers, journalists and members of the general public who want a big picture of the impact on local ecosystems and communities are currently limited by discipline-specific and fractured information on the various aspects of the incident and its impacts.

To solve this problem, Assistant Professor in the School of Library and Information Science Yejun Wu is leading the way in information convergence on oil spill events. Wu’s lab has created a first edition of an online topic map, available at http://topicmap.lsu.edu/, that brings together information from a wide range of research fields including biological science, chemistry, coastal and environmental science, engineering, political science, mass communication studies and many other disciplines in order to promote collaboration and big picture understanding of technological disasters.

“Researchers, journalists, politicians and even school teachers wanted to know the impacts of the Deepwater Horizon oil spill incident,” Wu said. “I felt this was an opportunity to develop a tool for supporting learning and knowledge discovery. Our topic map tool can help people learn from historical events to better prepare for the future.”

Wu started the project with a firm belief in the need for an oil spill information hub.

“There is a whole list of historical oil spill events that we probably neglected – we did not learn enough from history,” Wu said.

He first looked to domain experts from various disciplines to share their own views of the impacts of the Deepwater Horizon oil spill. From there, Wu and his research associate and graduate students manually collected more than 7,000 concepts and 4,000 concept associations related to oil spill incidents worldwide from peer-reviewed journal articles and authoritative government websites, loading the information into an organizational topic map software program. Prior to these efforts by Wu’s lab, no comprehensive oil spill topic map or taxonomy existed.

“Domain experts typically focus on oil spill research in their own area, such as chemistry or political communication, but an oil spill is a comprehensive problem, and studies should be interdisciplinary,” Wu said. “Experts in different fields that usually don’t talk to each can benefit from a tool that brings together and organizes information concepts across many disciplines.”

Wikipedia calls it: Deepwater Horizon oil spill. I think BP Oil Spill is a better name.

Just thinking of environmental disasters, which ones would you suggest for topic maps?

grammar why ! matters

Filed under: Grammar,Language — Patrick Durusau @ 6:41 pm

grammar why ! matters

Bob Carpenter has a good rant on grammar.

The test I would urge everyone to use before buying software or even software services is to ask to see their documentation.

Give it to one of your technical experts and ask them to turn to any page and start reading.

If at any point your expert asks what was meant, thank the vendor for their time and show them the door.

It will save you time and expense in the long run to use only software with good documentation. (It would be nice to have software that doesn’t crash often too but I would not ask for the impossible.)

OpenCalais

Filed under: Annotation,OpenCalais — Patrick Durusau @ 6:21 pm

OpenCalais

From the introduction page:

The free OpenCalais service and open API is the fastest way to tag the people, places, facts and events in your content.  It can help you improve your SEO, increase your reader engagement, create search-engine-friendly ‘topic hubs’ and streamline content operations – saving you time and money.

OpenCalais is free to use in both commercial and non-commercial settings, but can only be used on public content (don’t run your confidential or competitive company information through it!). OpenCalais does not keep a copy of your content, but it does keep a copy of the metadata it extracts there from.

To repeat, OpenCalais is not a private service, and there is no secure, enterprise version that you can buy to operate behind a firewall. It is your responsibility to police the content that you submit, so make sure you are comfortable with our Terms of Service (TOS) before you jump in.

You can process up to 50,000 documents per day (blog posts, news stories, Web pages, etc.) free of charge.  If you need to process more than that – say you are an aggregator or a media monitoring service – then see this page to learn about Calais Professional. We offer a very affordable license.

OpenCalais’ early adopters include CBS Interactive / CNET, Huffington Post, Slate, Al Jazeera, The New Republic, The White House and more. Already more than 30,000 developers have signed up, and more than 50 publishers and 75 entrepreneurs are using the free service to help build their businesses.

You can read about the pioneering work of these publishers, entrepreneurs and developers here.

To get started, scroll to the bottom section of this page. To build OpenCalais into an existing site or publishing platform (CMS), you will need to work with your developers. 

I thought I had written about OpenCalais but it turns out it was just in quotes in other posts. Should know better than to rely on my memory. 😉

The 50,000 document per day limit sounds reasonable to me and should be enough for some interesting experiments. Perhaps even comparisons of the results from different tagging projects.

Not to say one is better than another but to identify spots on semantic margins where ambiguity may be found.

Historical documents should make interesting test subjects.

Being cautious the further back in history we reach, the less meaningful it is to say a word has a “correct” meaning. An author used it with a particular meaning but that passed from our ken with the passing of the author and their linguistic community. We can guess what may have been meant, but nothing more.

Semantator: annotating clinical narratives with semantic web ontologies

Filed under: Annotation,Ontology,Protégé,RDF,Semantator,Semantic Web — Patrick Durusau @ 2:40 pm

Semantator: annotating clinical narratives with semantic web ontologies by Dezhao Song, Christopher G. Chute, and Cui Tao. (AMIA Summits Transl Sci Proc. 2012;2012:20-9. Epub 2012 Mar 19.)

Abstract:

To facilitate clinical research, clinical data needs to be stored in a machine processable and understandable way. Manual annotating clinical data is time consuming. Automatic approaches (e.g., Natural Language Processing systems) have been adopted to convert such data into structured formats; however, the quality of such automatically extracted data may not always be satisfying. In this paper, we propose Semantator, a semi-automatic tool for document annotation with Semantic Web ontologies. With a loaded free text document and an ontology, Semantator supports the creation/deletion of ontology instances for any document fragment, linking/disconnecting instances with the properties in the ontology, and also enables automatic annotation by connecting to the NCBO annotator and cTAKES. By representing annotations in Semantic Web standards, Semantator supports reasoning based upon the underlying semantics of the owl:disjointWith and owl:equivalentClass predicates. We present discussions based on user experiences of using Semantator.

If you are an AMIA member, see above for the paper. If not, see: Semantator: annotating clinical narratives with semantic web ontologies (PDF file). And the software/webpage: Semantator.

Software is a plugin for Protege 4.1 or higher.

Looking at the extensive screen shots at the website, which has good documentation, the first question I would ask a potential user is: “Are you comfortable with Protege?” If they aren’t I suspect you are going to invest a lot of time in teaching them ontologies and Protege. Just an FYI.

Complex authoring tools, particularly for newbies, seem like a non-starter to me. For example, why not have a standalone entity extractor (but don’t call it that, call it “I See You (ISY)) that uses a preloaded entity file to recognize entities in a text. Where there is uncertainty, those are displayed in a different color, with drop down options on possible other entities. User get to pick one from the list (no write in ballots). Performs a step towards getting clean data for a second round with another one-trick-pony tool. User contributes, we all benefit.

Which brings me to the common shortfall of annotation solutions: the requirement that the text to be annotated be in plain text.

There are lot of “text” documents but what of those in Word, PDF, Postscript, PPT, Excel, to say nothing of other formats?

The past will not disappear for want of a robust annotation solution.

Nor should it.

Real-time Twitter heat map with MongoDB

Filed under: Mapping,Maps,MongoDB,Tweets — Patrick Durusau @ 1:54 pm

Real-time Twitter heat map with MongoDB

From the post:

Over the last few weeks I got in touch with the fascinating field of data visualisation which offers great ways to play around with the perception of information.

In a more formal approach data visualisation denotes “The representation and presentation of data that exploits our visual perception abilities in order to amplify cognition

Nowadays there is a huge flood of information that hit’s us everyday. Enormous amounts of data collected from various sources are freely available on the internet. One of these data gargoyles is Twitter producing around 400 million (400 000 000!) tweets per day!

Tweets basically offer two “layers” of information. The obvious direct information within the text of the Tweet itself and also a second layer that is not directly perceived which is the Tweets’ metadata. In this case Twitter offers a large number of additional information like user data, retweet count, hashtags, etc. This metadata can be leveraged to experience data from Twitter in a lot of exciting new ways!

So as a little weekend project I have decided to build a small piece of software that generates real-time heat maps of certain keywords from Twitter data.

Yes, “…in a lot of exciting new ways!” +1!

What about maintenance issues on such a heat map? The capture of terms to the map is fairly obvious, but a subsequent user may be left in the dark as to why this term and not some other term? Or some then current synonym for a term that is being captured?

Or imposing semantics on tweets or terms that are unexpected or non-obvious to a casual or not so casual observer.

You and I can agree red means go and green means stop in a tweet. That’s difficult to maintain as the number of participants and terms go up.

A great starting place to experiment with topic maps to address such issues.

I first saw this in the NoSQL Weekly Newsletter.

July 11, 2012

Introducing Galaxy, a novel in-memory data grid by Parallel Universe

Filed under: Distributed RAM,Galaxy,Memory — Patrick Durusau @ 2:28 pm

Introducing Galaxy, a novel in-memory data grid by Parallel Universe

Let me jump to the cool part:

Galaxy is a distributed RAM. It is not a key-value store. Rather, it is meant to be used as a infrastructure for building distributed data-structures. In fact, there is no way to query objects stored on Galaxy at all. Instead, Galaxy generates an ID for each item, that you can store in other items just like you’d store a normal reference in a plain object graph.

The application runs on all Galaxy nodes alongside with the portion of the data that is kept (in RAM) at each of the nodes, and when it wishes to read or write a data item, it requests the Galaxy API to fetch it.

At any given time an item is owned by exactly one node, but can be shared by many. Sharers store the item locally, but they can only read it. However, they remember who the owner is, and the owner maintains a list of all sharers. If a sharer (or any node) wants to update the item (a “write”) it requests the current owner for a transfer of ownership, and then receives the item and the list of sharers. Before modifying the item, it invalidates all sharers to ensure consistency. Even when the sharers are invalidated, they remember who the new owner is, so if they’d like to share or own the item again, they can request it from the new owner. If the application requests an item the local node has never seen (or it’s been migrated again after it had been validated), the node multicasts the entire cluster in search of it.

The idea is that when data access is predictable, expensive operations like item migration and a clueless lookup are rare, and more than offset by the common zero-I/O case. In addition, Galaxy uses some nifty hacks to eschew many of the I/O delays even in worst-case scenarios.

In the coming weeks I will post here the exact details of Galaxy’s inner-workings. What messages are transferred, how Galaxy deals with failures, and what tricks it employs to reduce latencies. In the meantime, I encourage you to read Galaxy’s documentation and take it for a spin.

May not fit your use case but like the man says, “take it for a spin.”

Jack Park sent this to my attention.

Importing public data with SAS instructions into R

Filed under: Data,Government Data,Parsing,Public Data,R — Patrick Durusau @ 2:28 pm

Importing public data with SAS instructions into R by David Smith.

From the post:

Many public agencies release data in a fixed-format ASCII (FWF) format. But with the data all packed together without separators, you need a “data dictionary” defining the column widths (and metadata about the variables) to make sense of them. Unfortunately, many agencies make such information available only as a SAS script, with the column information embedded in a PROC IMPORT statement.

David reports on the SAScii package from Anthony Damico.

You still have to parse the files but it gets you one step closer to having useful information.

GraphPack

Filed under: Graph Traversal,GraphPack,Graphs,Networks,Traversal — Patrick Durusau @ 2:27 pm

GraphPack

From the webpage:

GraphPack is a network of autonomous services that manage graph structures. Each node in those graphs may refer to a node in another service, effectively forming a distributed graph. GraphPack deals with the processing of such decentralized graphs. GraphPack supports its own traverse/query language (inspired by neo4j::cypher) that can executed as transparently distributed traverses.

Amit Portnoy wrote about GraphPack on the neo4j mailing list:

The prototype, called GraphPack, has a very lite design, actual persistence and communication aspects can be easily pluged-in by injection (using Guice).

GraphPack enables transperantly distributed traverses (in a decentralized graph), which can be specified by Cypher-inspired traverse specification language.

That is, clients of a GraphPack service (which may have a graph the refer to other nodes in other GraphPack services) can write a cypher-like expressions and simply receive a result, while the actual implementation may make many remote communication steps. This is done by deriving a new traverse specification in every edge a long specified paths and sending this derived specification to the next node (in other words, the computation moves along the nodes that are matched by the traverse specification).

Sounds like there should be lessons here for distributed topic maps. Yes?

Robustness Elasticity in Complex Networks

Filed under: Complex Networks,Graphs,Networks — Patrick Durusau @ 2:27 pm

Robustness Elasticity in Complex Networks by Timothy C. Matisziw, Tony H. Grubesic, and Junyu Guo. (Matisziw TC, Grubesic TH, Guo J (2012) Robustness Elasticity in Complex Networks. PLoS ONE 7(7): e39788. doi:10.1371/journal.pone.0039788)

Abstract:

Network robustness refers to a network’s resilience to stress or damage. Given that most networks are inherently dynamic, with changing topology, loads, and operational states, their robustness is also likely subject to change. However, in most analyses of network structure, it is assumed that interaction among nodes has no effect on robustness. To investigate the hypothesis that network robustness is not sensitive or elastic to the level of interaction (or flow) among network nodes, this paper explores the impacts of network disruption, namely arc deletion, over a temporal sequence of observed nodal interactions for a large Internet backbone system. In particular, a mathematical programming approach is used to identify exact bounds on robustness to arc deletion for each epoch of nodal interaction. Elasticity of the identified bounds relative to the magnitude of arc deletion is assessed. Results indicate that system robustness can be highly elastic to spatial and temporal variations in nodal interactions within complex systems. Further, the presence of this elasticity provides evidence that a failure to account for nodal interaction can confound characterizations of complex networked systems.

As you might expect, I am reading this paper from the perspective of connections between nodes of information and not, for example, nodes on the Internet. I suspect it is particularly relevant for models of (or the lack thereof) information sharing between agencies for example.

With only anecdotal evidence about sharing, it isn’t possible to determine what structural, organization or other barriers are blocking the flow of information. Or what changes would result in the largest amount of effective information sharing. If shared information is mired at upper management levels, then it is of little use to actual analysts.

BTW, you can rest assured this work will not be used inappropriately:

Matisziw’s model was documented in the publicly available journal PLoS ONE. Making such a powerful tool widely available won’t be a danger, Matisziw said. To use his model, a network must be understood in detail. Since terrorists and other criminals don’t have access to enough data about the networks, they won’t be able to use the model to develop doomsday scenarios. [From: Cyberwarfare, Conservation and Disease Prevention Could Benefit from New Network Model, where I first saw this story.

Do terrorists develop anything other than “doomsday scenarios?” Or is that just PR? Like Y2K or the recent DNS issue. Everyone gets worked up, small bump, life goes on.

Compressive Genomics [Compression as Merging]

Filed under: Bioinformatics,Compression,Genome,Merging,Scalability — Patrick Durusau @ 2:27 pm

Compressive genomics by Po-Ru Loh, Michael Baym, and Bonnie Berger (Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241)

From the introduction:

In the past two decades, genomic sequencing capabilities have increased exponentially[cites omitted] outstripping advances in computing power[cites omitted]. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected[cite omitted]; thus, the amount of new sequence information is growing much more slowly.

Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)[cite omitted] and the BLAST-Like Alignment Tool (BLAT)[cite omitted], and we emphasize how compressive genomics will enable biologists to keep pace with current data.

Software available at: Compression-accelerated BLAST and BLAT.

A new line of attack on searching “big data.”

Making “big data” into “smaller data” and enabling analysis of it while still “smaller data.”

Enabling the searching of highly similar genomes by compression is a form of merging isn’t it? That is a sequence (read subject) that occurs multiple times over similar genomes is given a single representative, while preserving its relationship to all the individual genome instances.

What makes merger computationally tractable here and yet topic may systems, at least some of them, are reported to have scalability issues: Scalability of Topic Map Systems by Marcel Hoyer?

What other examples of computationally tractable merging would you suggest? Including different merging approaches/algorithms. Thinking it might be a useful paper/study to work from scalable merging examples towards less scalable ones. Perhaps to discover what choices have an impact on scalability.

Learning From Data [Machine Learning]

Filed under: Machine Learning — Patrick Durusau @ 2:26 pm

Learning from Data lectures by Professor Yaser Abu-Mostafa.

Just the main topics, these are composed of sub-topics on the lecture page (above):

  • Bayesian Learning
  • Bias-Variance Tradeoff
  • Bin Model
  • Data Snooping
  • Ensemble Learning
  • Error Measures
  • Gradient Descent
  • Learning Curves
  • Learning Diagram
  • Learning Paradigms
  • Linear Classification
  • Linear Regression
  • Logistic Regression
  • Netflix Competition
  • Neural Networks
  • Nonlinear Transformation
  • Occam’s Razor
  • Overfitting
  • Radial Basis Functions
  • Regularization
  • Sampling Bias
  • Support Vector Machines
  • Validation
  • VC Dimension

The textbook by the same title: Learning from Data.

The lectures look like a good place to get lost. For days!

Search Data at Scale in Five Minutes with Pig, Wonderdog and ElasticSearch

Filed under: ElasticSearch,Pig,Wonderdog — Patrick Durusau @ 2:26 pm

Search Data at Scale in Five Minutes with Pig, Wonderdog and ElasticSearch

Russell Jurney continues his posts on searching at scale:

Working code examples for this post (for both Pig 0.10 and ElasticSearch 0.18.6) are available here.

ElasticSearch makes search simple. ElasticSearch is built over Lucene and provides a simple but rich JSON over HTTP query interface to search clusters of one or one hundred machies. You can get started with ElasticSearch in five minutes, and it can scale to support heavy loads in the enterprise. ElasticSearch has a Whirr Recipe, and there is even a Platform-as-a-Service provider, Bonsai.io.

Apache Pig makes Hadoop simple. In a previous post, we prepared the Berkeley Enron Emails in Avro format. The entire dataset is available in Avro format here: https://s3.amazonaws.com/rjurney.public/enron.avro. Lets check them out:

Scale is important for some queries but what other factors are important for searches?

Thinking that Google is searching at scale. Is that a counter-example to scale being the only measure of search success? Or the best measure?

Or is scale of searching just a starting point?

Where do you go after scale? Scale is easy to evaluate/measure, so whatever your next step, how is it evaluated or measured?

Or is that the reason for emphasis on scale/size? It’s an easy mark(in several senses)?

Twitter Languages of London

Filed under: Tweets,Visualization — Patrick Durusau @ 2:25 pm

Twitter Languages of London by James Cheshire.

From the post:

Last year Eric Fischer produced a great map (see below) visualising the language communities of Twitter. The map, perhaps unsurprisingly, closely matches the geographic extents of the world’s major linguistic groups. On seeing these broad patterns I wondered how well they applied to the international communities living in London. The graphic above shows the spatial distribution of about 470,000 geo-located tweets (collected and georeferenced by Steven Gray) grouped by the language stated in their user’s profile information*. Unsurprisingly, English is by far the most popular. More surprising, perhaps, is the very similar distributions of most of the other languages- with higher densities in central areas and a gradual spreading to the outskirts (I expected greater concentrations in particular areas of the city). Arabic (and Farsi) tweets are much more concentrated around the Hyde Park, Marble Arch and Edgware Road areas whilst the Russian tweeters tend to stick to the West End. Polish and Hungarian tweets appear the most evenly spread throughout London.

Interesting visualization of tweet locations in London and the languages of the same.

Ties in with something I need to push out this week.

On using Twitter as a public but secure intelligence channel. More on that either later today or tomorrow.

Scalatron

Filed under: Games,Programming,Scala — Patrick Durusau @ 2:25 pm

Scalatron: Learn Scala with a programming game

From the homepage:

Scalatron is a free, open-source programming game in which bots, written in Scala, compete in a virtual arena for energy and survival. You can play by yourself against the computer or organize a tournament with friends. Scalatron may be the quickest and most entertaining way to become productive in Scala. – For updates, follow @scalatron on Twitter.

Entertaining and works right out of the “box.”

Well, remember the HBase 8080 conflict issue, so from the Scalatron documentation:

java -jar Scalatron.jar -help

Displays far more command line options than will be meaningful at first.

For the HBase 8080 issue, you need:

java -jar Scalatron.jar port int

or in my case:

java -jar Scalatron.jar port 9000

Caution, on startup it will ask to make Google Chrome your default browser. Good that it asks but annoying. Why not leave the user with whatever default browser they already prefer?

Anyway, starts up, asks you to create a user account (browser window) and can set the Administrator password.

Scalatron window opens up and I can tell this could be real addictive, in or out of ISO WG meetings. 😉

Scala resources mentioned in the Scalatron Tutorial document:

Other Resources

It’s a bit close to the metal to use as a model for a topic map “game.”

But I like the idea of “bots” (read teams) competing against each other, except for the construction of a topic map.

Just sketching some rough ideas but assuming some asynchronous means of communication, say tweets, emails, IRC chat, a simple syntax (CTM anyone?), basic automated functions and scoring, that should be doable, even if not on a “web” scale. 😉

By “basic automated functions” I mean more than simply parsing syntax for addition to a topic map but including the submission of DOIs, for example, which are specified to be resolved against a vendor or library catalog, with the automatic production of additional topics, associations, etc. Repetitive entry of information by graduate students only proves they are skillful copyists.

Assuming some teams will discover the same information as others, some timing mechanism and awarding of “credit” for topics/associations/occurrences added to the map would be needed.

Not to mention the usual stuff of contests, leader board, regular updating of the map, along with graph display, etc.

Something to think about. As I tell my daughter, life is too important to be taken seriously. Perhaps the same is true about topic maps.

Forwarded by Jack Park. (Who is not responsible for my musings on the same.)

July 10, 2012

Linked Media Framework [Semantic Web vs. ROI]

Filed under: Linked Data,RDF,Semantic Web,SKOS,SPARQL — Patrick Durusau @ 11:08 am

Linked Media Framework

From the webpage:

The Linked Media Framework is an easy-to-setup server application that bundles central Semantic Web technologies to offer advanced services. The Linked Media Framework consists of LMF Core and LMF Modules.

LMF Usage Scenarios

The LMF has been designed with a number of typical use cases in mind. We currently support the following tasks out of the box:

Target groups are a in particular casual users who are not experts in Semantic Web technologies but still want to publish or work with Linked Data, e.g. in the Open Government Data and Linked Enterprise Data area.

It is a bad assumption that workers in business or government have free time to add semantics to their data sets.

If adding semantics to your data, by linked data or other means is a core value, resource the task just like any other with your internal staff or hire outside help.

A Semantic Web short coming is the attitude that users are interested in or have the time to build it. Assuming the project to be worthwhile and/or doable.

Users are fully occupied with tasks of their own and don’t need a technical elite tossing more tasks onto them. You want the Semantic Web? Suggest you get on that right away.

Integrated data that meets a business need and has proven ROI isn’t the same thing as the Semantic Web. Give me a call if you are interested in the former, not the latter. (I would do the latter as well, but only on your dime.)

I first saw this at semanticweb.com, announcing version 2.2.0 of lmf – Linked Media Framework.

What is Linked Data

Filed under: Government Data,Linked Data,LOD — Patrick Durusau @ 10:37 am

What is Linked Data by John Goodwin.

From the post:

In the early 1990s there began to emerge a new way of using the internet to link documents together. It was called the World Wide Web. What the Web did that was fundamentally new was that it enabled people to publish documents on the internet and link them such that you could navigate from one document to another.

Part of Sir Tim Berners-Lee’s original vision of the Web was that it should also be used to publish, share and link data. This aspect of Sir Tim’s original vision has gained a lot of momentum over the last few years and has seen the emergence of the Linked Data Web.

The Linked Data Web is not just about connecting datasets, but about linking information at the level of a single statement or fact. The idea behind the Linked Data Web is to use URIs (these are like the URLs you type into your browser when going to a particular website) to identify resources such as people, places and organisations, and to then use web technology to provide some meaningful and useful information when these URIs are looked up. This ‘useful information’ can potentially be returned in a number of different encodings or formats, but the standard way for the linked data web is to use something called RDF (Resource Description Framework).

An introductory overview of the rise and use of linked data.

John is involved in efforts at data.gov.uk to provide open access to governmental data and one form of that delivery will be linked data.

You will be encountering linked data, both as a current and legacy format so it is worth your time to learn it now.

I first saw this at semanticweb.com.

Thinking in Datomic: Your data is not square

Filed under: Datomic,Design,NoSQL — Patrick Durusau @ 10:19 am

Thinking in Datomic: Your data is not square by Pelle Braendgaard.

From the post:

Datomic is so different than regular databases that your average developer will probably choose to ignore it. But for the developer and startup who takes the time to understand it properly I think it can be a real unfair advantage as a choice for a data layer in your application.

In this article I will deal with the core fundamental definition of how data is stored in Datomic. This is very different from all other databases so before we even deal with querying and transactions I think it’s a good idea to look at it.

Yawn, “your data is not square.” 😉 Just teasing.

But we have all heard the criticism of relational tables. I think writers can assume that much, at least in technical forums.

The lasting value of the NoSQL movement (in addition to whichever software packages survive) will be its emphasis on analysis of your data. Your data may fit perfectly well into a square but you need to decide that after looking at your data, not before.

The same can be said about the various NoSQL offerings. Your data may or may not be suited for a particular NoSQL option. The data analysis “cat being out of the bag,” it should be applied to NoSQL options as well. True, almost any option will work, your question should be why is option X the best option for my data/use case?

« Newer PostsOlder Posts »

Powered by WordPress