Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 12, 2013

“Almost there….” (Computing Homology)

Filed under: Data Analysis,Feature Spaces,Homology,Topological Data Analysis,Topology — Patrick Durusau @ 4:03 pm

We all remember the pilot in Star Wars that kept saying, “Almost there….” Jeremy Kun has us “almost there…” in his latest installment: Computing Homology.

To give you some encouragement, Jeremy concludes the post saying:

The reader may be curious as to why we didn’t come up with a more full-bodied representation of a simplicial complex and write an algorithm which accepts a simplicial complex and computes all of its homology groups. We’ll leave this direct approach as a (potentially long) exercise to the reader, because coming up in this series we are going to do one better. Instead of computing the homology groups of just one simplicial complex using by repeating one algorithm many times, we’re going to compute all the homology groups of a whole family of simplicial complexes in a single bound. This family of simplicial complexes will be constructed from a data set, and so, in grandiose words, we will compute the topological features of data.

If it sounds exciting, that’s because it is! We’ll be exploring a cutting-edge research field known as persistent homology, and we’ll see some of the applications of this theory to data analysis. (bold emphasis added)

Data analysts are needed at all levels.

Do you want to be a spreadsheet data analyst or something a bit harder to find?

NLTK 2.3 – Working with Wordnet

Filed under: Lisp,Natural Language Processing,NLTK,WordNet — Patrick Durusau @ 3:38 pm

NLTK 2.3 – Working with Wordnet by Vsevolod Dyomkin.

From the post:

I’m a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn’t mean that work on CL-NLP has stopped – I’ve just had an unexpected vacation and also worked on parts, related to writing programs for the excellent Natural Language Processing by Michael Collins Coursera course.

Today we’ll start looking at Chapter 2, but we’ll do it from the end, first exploring the topic of Wordnet.

Vsevolod more than makes up for his absence with his post on Wordnet.

As a sample, consider this graphic of the potential of Wordnet:

Wordnet schema

Pay particular attention to the coverage of similarity measures.

Enjoy!

50,000 Lessons on How to Read:…

Filed under: Associations,Corpora,Natural Language Processing,Relation Extraction — Patrick Durusau @ 3:28 pm

50,000 Lessons on How to Read: a Relation Extraction Corpus by Dave Orr, Product Manager, Google Research.

From the post:

One of the most difficult tasks in NLP is called relation extraction. It’s an example of information extraction, one of the goals of natural language understanding. A relation is a semantic connection between (at least) two entities. For instance, you could say that Jim Henson was in a spouse relation with Jane Henson (and in a creator relation with many beloved characters and shows).

The goal of relation extraction is to learn relations from unstructured natural language text. The relations can be used to answer questions (“Who created Kermit?”), learn which proteins interact in the biomedical literature, or to build a database of hundreds of millions of entities and billions of relations to try and help people explore the world’s information.

To help researchers investigate relation extraction, we’re releasing a human-judged dataset of two relations about public figures on Wikipedia: nearly 10,000 examples of “place of birth”, and over 40,000 examples of “attended or graduated from an institution”. Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems. We also plan to release more relations of new types in the coming months.

Another step in the “right” direction.

This is a human-curated set of relation semantics.

Rather than trying to apply this as a universal “standard,” what if you were to create a similar data set for your domain/enterprise?

Using human curators to create and maintain a set of relation semantics?

Being a topic mappish sort of person, I suggest the basis for their identification of the relationship be explicit, for robust re-use.

But you can repeat the same analysis over and over again if you prefer.

Null Values in Neo4j

Filed under: Graphs,Neo4j — Patrick Durusau @ 1:04 pm

While researching the new labels feature in Neo4j 2.0.0-M01, I ran across the following statement in the documentation:

Note

null is not a valid property value. Nulls can instead be modeled by the absence of a key.

(Properties 3.3)

The question is asked in the comments:

Bryan Watson • a year ago −
What is implied by the absence of a key (“null”)?
(1) the key is relevant but the value is unknown? (spouse of a customer)

and answered:

Andrés Taylor Bryan Watson • a year ago −
In Neo4j, the absence of a key can mean all three options. Is that problematic in a particular concrete case, or are you wondering in the general case?

and,

Andrés Taylor Bryan Watson • a year ago −
You are correct. Neo4j is an unstructured database, which moves some of the responsibilities to the application. This is one of the things that the application has to take care of.

Problematic for topic maps because a role in an association may be known but the player of that role unknown.

Moreover, what happens if there are multiple “nulls?”

An application could have a “schema” for a node type that makes it possible to spot missing keys, but that seems like a long way to go for a “null.”

I don’t find “unstructured database” to be a persuasive argument for moving responsibilities to an application.

Databases, unstructured or otherwise, should be able to deal robustly with the various cases of “null.”

April 11, 2013

Glass – Another Topic Map Medium?

Filed under: Marketing,Topic Maps — Patrick Durusau @ 4:29 pm

If you haven’t seen Glass, go to: http://www.google.com/glass/start/

If lame search results are annoying on your desktop, pad or cellphone, imagine not being able to escape them.

Of for a positive spin, would you want a service provider with better results?

Bad data “in your face” may be the selling point we need.

Spreadsheet is Still the King of all Business Intelligence Tools

Filed under: Business Intelligence,Marketing,Spreadsheets,Topic Maps — Patrick Durusau @ 4:01 pm

Spreadsheet is Still the King of all Business Intelligence Tools by Jim King.

From the post:

The technology consulting firm Gartner Group Inc. once precisely predicated that BI would be the hottest technology in 2012. The year of 2012 witnesses the sharp and substantial increase of BI. Unexpectedly, spreadsheet turns up to be the one developed and welcomed most, instead of the SAP BusinessObjects, IBM Cognos, QlikTech Qlikview, MicroStrateg, or TIBCO Spotfire. In facts, no matter it is in the aspect of total sales, customer base, or the increment, the spreadsheet is straight the top one.

Why the spreadsheet is still ruling the BI world?

See Jim’s post for the details but the bottom line was:

It is the low technical requirement, intuitive and flexible calculation capability, and business-expert-oriented easy solution to the 80% BI problems that makes the spreadsheet still rule the BI world.

Question:

How do you translate:

  • low technical requirement
  • intuitive and flexible calculation capacity (or its semantic equivalent)
  • business-expert-oriented solution to the 80% of BI problems

into a topic map application?

Selling Topic Maps: One Feature At A Time?

Filed under: Marketing,Topic Maps — Patrick Durusau @ 3:45 pm

Dylan Jones writes in Data Quality: One Habit at a Time:

I started learning about data quality management back in 1992. Back then there were no conferences, limited publications and if you received an email via the internet the excitement lasted for hours.

Fast forward to today. We are practically swamped with data quality knowledge outlets. Sites like the Data Roundtable, OCDQ Blog and scores of other data quality bloggers provide practical ideas and techniques on an almost hourly basis.

We never lack for ideas and methods for implementing data quality management, and of course this is hugely beneficial for professionals looking to mature data quality in their organisation.

However, with all this knowledge comes a warning. Data quality management can only succeed when behaviours are changed, but to change a person’s behaviour requires the formation of new habits. This is where many projects will ultimately fail.

Have you ever started the New Year with a promise to change your ways and introduce new habits? Perhaps the guilt of festive excesses drove you to join a gym or undertake some other new health regime. How was that health drive looking in March? How about September?

The problem of habit formation is exacerbated when we attempt to change multiple habits. Perhaps we want to combine a regular running regime with learning new skills. The result is often failure.

Does your topic maps sales pitch require too much change? (I know mine does.)

Or do you focus on the one issue/problem that your client needs solving?

Sure, topic maps enable robust integration of diverse data stores but it that’s not your clients issue, why bring it up?

Can we sell more by promising less?

Cargo Cult Data Science [Cargo Cult Semantics?]

Filed under: Data Science,Semantics — Patrick Durusau @ 3:30 pm

Cargo Cult Data Science by Jim Harris.

From the post:

Last week, Phil Simon blogged about being wary of snake oil salesman who claim to be data scientists. In this post, I want to explore a related concept, namely being wary of thinking that you are performing data science by mimicking what data scientists do.

The American theoretical physicist Richard Feynman coined the term cargo cult science to refer to practices that have the semblance of being scientific, but do not in fact follow the scientific method.

As Feynman described his analogy, “in the South Seas there is a cult of people. During the war they saw airplanes land with lots of materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas—he’s the controller—and they wait for the airplanes to land. They’re doing everything right. The form is perfect. But it doesn’t work. No airplanes land. So I call these things Cargo Cult Science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.”

Feynman’s description of the runway and controller reminds me of attempts to create systems with semantic “understanding.”

We load them up with word lists, thesauri, networks of terms, the equivalent of runways.

We give them headphones (ontologies) with bars of bamboo (syntax) sticking out of them.

And after all that, semantic understanding continues to elude us.

Maybe those efforts are missing something essential? (Like us?)

PyData and More Tools…

Filed under: Data Science,PyData,Python — Patrick Durusau @ 3:14 pm

PyData and More Tools for Getting Started with Python for Data Scientists by Sean Murphy.

From the post:

It would turn out that people are very interested in learning more about python and our last post, “Getting Started with Python for Data Scientists,” generated a ton of comments and recommendations. So, we wanted to give back those comments and a few more in a new post. As luck would have it, John Dennison, who helped co-author this post (along with Abhijit), attended both PyCon and PyData and wanted to sneak in some awesome developments he learned at the two conferences.

I make out at least seventeen (17) different Pthon resources, libraries, etc.

Enough to keep you busy for more than a little while. 😉

MS Machine Learning Summit [23 April 2013]

Filed under: Machine Learning,Microsoft — Patrick Durusau @ 2:29 pm

MS Machine Learning Summit

From the post:

The live broadcast of the Microsoft Research Machine Learning Summit will include keynotes from machine learning experts and enlightening discussions with leading scientific and academic researchers about approaches to challenges that are raised by the new era in machine learning. Watch it streamed live from Paris on April 23, 2013, 13:30–17:00 Greenwich Mean Time (09:30–13:00 Eastern Time, 06:30–10:00 Pacific Time) at http://MicrosoftMLS.com.

I would rather be in Paris but watching the live stream will be a lot cheaper!

GroningenMeaningBank (GMB)

Filed under: Corpora,Corpus Linguistics,Linguistics,Semantics — Patrick Durusau @ 2:19 pm

GroningenMeaningBank (GMB)

From the “about” page:

The Groningen Meaning Bank consists of public domain English texts with corresponding syntactic and semantic representations.

Key features

The GMB supports deep semantics, opening the way to theoretically grounded, data-driven approaches to computational semantics. It integrates phenomena instead of covering single phenomena in isolation. This provides a better handle on explaining dependencies between various ambiguous linguistic phenomena, including word senses, thematic roles, quantifier scrope, tense and aspect, anaphora, presupposition, and rhetorical relations. In the GMB texts are annotated rather than
isolated sentences, which provides a means to deal with ambiguities on the sentence level that require discourse context for resolving them.

Method

The GMB is being built using a bootstrapping approach. We employ state-of-the-art NLP tools (notably the C&C tools and Boxer) to produce a reasonable approximation to gold-standard annotations. From release to release, the annotations are corrected and refined using human annotations coming from two main sources: experts who directly edit the annotations in the GMB via the Explorer, and non-experts who play a game with a purpose called Wordrobe.

Theoretical background

The theoretical backbone for the semantic annotations in the GMB is established by Discourse Representation Theory (DRT), a formal theory of meaning developed by the philosopher of language Hans Kamp (Kamp, 1981; Kamp and Reyle, 1993). Extensions of the theory bridge the gap between theory and practice. In particular, we use VerbNet for thematic roles, a variation on ACE‘s named entity classification, WordNet for word senses and Segmented DRT for rhetorical relations (Asher and Lascarides, 2003). Thanks to the DRT backbone, all these linguistic phenomena can be expressed in a first-order language, enabling the practical use of first-order theorem provers and model builders.

Step back towards the source of semantics (that would be us).

One practical question is how to capture semantics for a particular domain or enterprise?

Another is what to capture to enable the mapping of those semantics to those of other domains or enterprises?

Efficient comparison of sets of intervals with NC-lists

Filed under: Bioinformatics,Set Intersection,Sets — Patrick Durusau @ 1:00 pm

Efficient comparison of sets of intervals with NC-lists by Matthias Zytnicki, YuFei Luo and Hadi Quesneville. (Bioinformatics (2013) 29 (7): 933-939. doi: 10.1093/bioinformatics/btt070)

Abstract:

Motivation: High-throughput sequencing produces in a small amount of time a large amount of data, which are usually difficult to analyze. Mapping the reads to the transcripts they originate from, to quantify the expression of the genes, is a simple, yet time demanding, example of analysis. Fast genomic comparison algorithms are thus crucial for the analysis of the ever-expanding number of reads sequenced.

Results: We used NC-lists to implement an algorithm that compares a set of query intervals with a set of reference intervals in two steps. The first step, a pre-processing done once for all, requires time O[#R log(#R) + #Q log(#Q)], where Q and R are the sets of query and reference intervals. The search phase requires constant space, and time O(#R + #Q + #M), where M is the set of overlaps. We showed that our algorithm compares favorably with five other algorithms, especially when several comparisons are performed.

Availability: The algorithm has been included to S–MART, a versatile tool box for RNA-Seq analysis, freely available at http://urgi.versailles.inra.fr/Tools/S-Mart. The algorithm can be used for many kinds of data (sequencing reads, annotations, etc.) in many formats (GFF3, BED, SAM, etc.), on any operating system. It is thus readily useable for the analysis of next-generation sequencing data.

Before you search for “NC-lists,” be aware that you will get this article as the first “hit” today in some popular search engines. Followed by a variety of lists for North Carolina.

A more useful search engine would allow me to choose the correct usage of a term and to re-run the query using the distinguished subject.

The expansion helps: Nested Containment List (NCList).

Familiar if you are working in bioinformatics.

More generally, consider the need to compare complex sequences of values for merging purposes.

Not a magic bullet but a technique you should keep in mind.

Origin: Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Alexander V. Alekseyenko and Christopher J. Lee. (Bioinformatics (2007) 23 (11): 1386-1393. doi: 10.1093/bioinformatics/btl647)

Clojure Data Analysis Cookbook

Filed under: Clojure,Data Analysis — Patrick Durusau @ 6:01 am

Clojure Data Analysis Cookbook by Eric Rochester.

I don’t have a copy of Clojure Data Analysis Cookbook but strongly suggest that you read the sample chapter before deciding to buy it.

You will find that two chapters, Chapter 6, Working with Incanter Datasets and Chapter 7, Preparing for and Performing Statistical Data Analysis with Incanter out of eleven are focused on Incanter.

The Incanter site, incanter.org, bills itself as “Incanter Data Sorcery.”

If you go to the blog tab, you will find the most recent entry is December 29, 2010.

Twitter tab shows the most recent tweet as July 21, 2012.

The discussion tab does point to recent discussions but since the first of the year (2013) it has been lite.

I am concerned that a March, 2013 title would have two chapters on what appears to not be a very active project.

Particularly in a rapidly moving area like data analysis.

April 10, 2013

Tim Berners-Lee Renounces XML?

Filed under: JSON,XML — Patrick Durusau @ 2:06 pm

Draft TAG Teleconference Minutes 4th of April 2013

In a discussion of ISSUE-34: XML Transformation and composability (e.g., XSLT,XInclude, Encryption) the following exchange takes place:

Noah: Lets go through the issues and see which we can close. … Processing model of XML. Is there any interest in this?

xmlFunctions-34

Tim: I’m happy to do things with XML. This came from when we’re talking about XML was processed. The meaning from XML has to be taken outside-in. Otherwise you cannot create new XML specifications that interweave with what exist. … Not clear people noticed that.

I note that traceker has several status codes we can assign, including OPEN, PENDING, REVIEW, POSTPONED, and CLOSED.

Tim: Henry did a lot more work on that. I don’t feel we need to put a whole lot of energy into XML at all. JSON is the new way for me. It’s much more straightforward.

Suggestion: if we think this is now resolved or uninteresting, CLOSE it; if we think it’s interesting but not now, then POSTPONED?

Tim: We need another concept besides OPEN/CLOSED. Something like NOT WORKING ON IT.

Noah: It has POSTPONED.

Tim: POSTPONED expresses a feeling of guilt. But there’s no guilt.

Noah: It’s close enough and I’m not looking forward to changing Tracker.

ht, you wanted to add 0.02USD

Henry: I’m happy to move this to the backburner. I think there’s a genuine issue here and of interest to the community but I don’t have the bandwidth.

Noah: We need to tell ourselves a story as to what these codes mean. … Historically we used CLOSED for “it’s in pretty good shape”.

Henry: I’m happy with POSTPONED and it’s better than CLOSED.

+1 for postponing

+1

RESOLUTION: We mark ISSUE-34 (xmlFunctions-34) POSTPONED

I think this is important, thanks for doing it noah

(emphasis added)

XML can be improved to be sure but the concept is not inherently flawed.

To JSON supporters, all I can say is XML wasn’t the bloated confusion you see now when it started.

Neo4j in Action – Software Metrics [Correction]

Filed under: Graphs,Neo4j,Software — Patrick Durusau @ 1:38 pm

Neo4j in Action – Software Metrics by Michael Hunger.

Michael walks through exploring a Java class as a graph.

Makes me curious about treating code as a graph in order to discover which classes call the same data?

BTW, the tweeted location: http://www.slideshare.net/mobile/jexp/class-graph-neo4j-and-software-metrics does not appear to work in a desktop browser.

I was able to locate: http://www.slideshare.net/jexp/class-graph-neo4j-and-software-metrics, which is the link I use above.

Can Big Data From Cellphones Help Prevent Conflict? [Privacy?]

Filed under: BigData,Data Mining,Privacy — Patrick Durusau @ 10:54 am

Can Big Data From Cellphones Help Prevent Conflict? by Emmanuel Letouzé.

From the post:

Data from social media and Ushahidi-style crowdsourcing platforms have emerged as possible ways to leverage cellphones to prevent conflict. But in the world of Big Data, the amount of information generated from these is too small to use in advanced data-mining techniques and “machine-learning” techniques (where algorithms adjust themselves based on the data they receive).

But there is another way cellphones could be leveraged in conflict settings: through the various types of data passively generated every time a device is used. “Phones can know,” said Professor Alex “Sandy” Pentland, head of the Human Dynamics Laboratory and a prominent computational social scientist at MIT, in a Wall Street Journal article. He says data trails left behind by cellphone and credit card users—“digital breadcrumbs”—reflect actual behavior and can tell objective life stories, as opposed to what is found in social media data, where intents or feelings are obscured because they are “edited according to the standards of the day.”

The findings and implications of this, documented in several studies and press articles, are nothing short of mind-blowing. Take a few examples. It has been shown that it was possible to infer whether two people were talking about politics using cellphone data, with no knowledge of the actual content of their conversation. Changes in movement and communication patterns revealed in cellphone data were also found to be good predictors of getting the flu days before it was actually diagnosed, according to MIT research featured in the Wall Street Journal. Cellphone data were also used to reproduce census data, study human dynamics in slums, and for community-wide financial coping strategies in the aftermath of an earthquake or crisis.

Very interesting post on the potential uses for cell phone data.

You can imagine what I think could be correlated with cellphone data using a topic map so I won’t bother to enumerate those possibilities.

I did want to comment on the concern about privacy or re-identification as Emmanuel calls it in his post from cellphone data.

Governments, who have declared they can execute any of us without notice or a hearing, are the guardians of that privacy.

That causes me to lack confidence in their guarantees.

Discussions of privacy should assume governments already have unfettered access to all data.

The useful questions become: How do we detect their misuse of such data? and How do we make them heartily sorry for that misuse?

For cell phone data, open access will give government officials more reason for pause than the ordinary citizen.

Less privacy for individuals but also less privacy for access, bribery, contract padding, influence peddling, and other normal functions of government.

In the U.S.A., we have given up our rights to public trial, probable cause, habeas corpus, protections against unreasonable search and seizure, to be free from touching by strangers, and several others.

What’s the loss of the right to privacy for cellphone data compared to catching government officials abusing their offices?

R Cheatsheets

Filed under: Data Mining,R — Patrick Durusau @ 10:29 am

R Cheatsheets

I ran across this collection of cheatsheets for R today.

The R Reference Card for Data Mining is interesting to me but you want to look at some of the others.

Enjoy!

Free Data Mining Tools [African Market?]

Filed under: Data Mining,jHepWork,Knime,Mahout,Marketing,Orange,PSPP,RapidMiner,Rattle,Weka — Patrick Durusau @ 10:17 am

The Best Data Mining Tools You Can Use for Free in Your Company by: Mawuna Remarque KOUTONIN.

Short descriptions of the usual suspects but a couple (jHepWork and PSPP) that were new to me.

  1. RapidMiner
  2. RapidAnalytics
  3. Weka
  4. PSPP
  5. KNIME
  6. Orange
  7. Apache Mahout
  8. jHepWork
  9. Rattle

An interesting site in general.

Consider the following pitch for business success in Africa:

Africa: Your Business Should be Profitable in 45 days or Die

And the reasons for that claim:

1. “It’s almost virgin here. There are lot of opportunities, but you have to fight!”

2. “Target the vanity class with vanity products. The “new rich” have lot of money. They are though on everything except their big ego and social reputation”

3. “Target the lazy executives and middle managers. Do the job they are paid for as a consultant. Be good, and politically savvy, and the money is yours”

4. “You’ll make more money in selling food or opening a restaurant than working for the Bank”

5. “You can’t avoid politics, but learn to think like the people your are talking with. Always finish your sentence with something like “the most important is the country’s development, not power. We all have to work in that direction”

6. “It’s about hard work and passion, but you should first forget about managing time like in Europe.

Take time to visit people, go to the vanity parties, have the patience to let stupid people finish their long empty sentences, and make the politicians understand that your project could make them win elections and strengthen their positions”

7. “Speed is everything. Think fast, Act fast, Be everywhere through friends, family and informants”

With the exception of #1, all of these points are advice I would give to someone marketing topic maps on any continent.

It may be easier to market topic maps where there are few legacy IT systems that might feel threatened by a new technology.

Agrifeeds…

Filed under: News — Patrick Durusau @ 5:44 am

Agrifeeds – What changes have been made and how

From the post:

The concept behind Agrifeeds has remained, in its core, the same. It harvests items from a collection of almost 200 feeds, in English, Spanish and French, and it offers the possibility of creating new feeds by filtering the aggregated content. The new version however has embellished and enriched the content being imported, thus enriching the items of the feeds it offers.

This has been accomplished with the use of both contributed modules, and modules that have been written purposely for Agrifeeds. For the sake of brevity, those modules that are either well known, or those that have been included mainly for visual purposed (e.g. the Calendar plugin for Views), will not be described in detail.

Despite having a garden and a few backyard chickens, agriculture isn’t my specialty. 😉

I mention AIMS (Agricultural Information Management Service) as an example of interesting IT development outside my core reading area.

Curious how you would use topic maps with a continuous sets of feeds?

…Cloud Integration is Becoming a Bigger Issue

Filed under: Cloud Computing,Data Integration,Marketing — Patrick Durusau @ 5:27 am

Survey Reports that Cloud Integration is Becoming a Bigger Issue by David Linthicum.

David cites a survey by KPMG that found thirty-three percent of executives complained of higher than expected costs for data integration in cloud projects.

One assume the brighter thirty-three percent of those surveyed. The remainder apparently did not recognize data integration issues in their cloud projects.

David writes:

Part of the problem is that data integration itself has never been sexy, and thus seems to be an issue that enterprise IT avoids until it can’t be ignored. However, data integration should be the life-force of the enterprise architecture, and there should be a solid strategy and foundational technology in place.

Cloud computing is not the cause of this problem, but it’s shining a much brighter light on the lack of data integration planning. Integrating cloud-based systems is a bit more complex and laborious. However, the data integration technology out there is well proven and supports cloud-based platforms as the source or the target in an integration chain. (emphasis added)

The more diverse data sources become, the larger data integration issues will loom.

Topic maps offer data integration efforts in cloud projects a choice:

1) You can integrate one off, either with inhouse or third-party tools, only to redo all that work with each new data source, or

2) You can integrate using a topic map (for integration or to document integration) and re-use the expertise from prior data integration efforts.

Suggest pitching topic maps as a value-add proposition.

Apache Lucene and Solr 4.2.1

Filed under: Lucene,Solr — Patrick Durusau @ 5:11 am

Bug fix releases for Apache Lucene and Solr.

Apache Lucene 4.2.1: Changes; Downloads.

Apache Solr 4.2.1: Changes; Downloads.

Apache cTAKES

Apache cTAKES

From the webpage:

Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named entities from various dictionaries including the Unified Medical Language System (UMLS) – medications, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, subject (patient, family member, etc.) and context (negated/not negated, conditional, generic, degree of certainty). Some of the attributes are expressed as relations, for example the location of a clinical condition (locationOf relation) or the severity of a clinical condition (degreeOf relation).

Apache cTAKES was built using the Apache UIMA Unstructured Information Management Architecture engineering framework and Apache OpenNLP natural language processing toolkit. Its components are specifically trained for the clinical domain out of diverse manually annotated datasets, and create rich linguistic and semantic annotations that can be utilized by clinical decision support systems and clinical research. cTAKES has been used in a variety of use cases in the domain of biomedicine such as phenotype discovery, translational science, pharmacogenomics and pharmacogenetics.

Apache cTAKES employs a number of rule-based and machine learning methods. Apache cTAKES components include:

  1. Sentence boundary detection
  2. Tokenization (rule-based)
  3. Morphologic normalization
  4. POS tagging
  5. Shallow parsing
  6. Named Entity Recognition
    • Dictionary mapping
    • Semantic typing is based on these UMLS semantic types: diseases/disorders, signs/symptoms, anatomical sites, procedures, medications
  7. Assertion module
  8. Dependency parser
  9. Constituency parser
  10. Semantic Role Labeler
  11. Coreference resolver
  12. Relation extractor
  13. Drug Profile module
  14. Smoking status classifier

The goal of cTAKES is to be a world-class natural language processing system in the healthcare domain. cTAKES can be used in a great variety of retrievals and use cases. It is intended to be modular and expandable at the information model and method level.
The cTAKES community is committed to best practices and R&D (research and development) by using cutting edge technologies and novel research. The idea is to quickly translate the best performing methods into cTAKES code.

Processing a text with cTAKES is a processing of adding semantic information to the text.

As you can imagine, the better the semantics that are added, the better searching and other functions become.

In order to make added semantic information interoperable, well, that’s a topic map question.

I first saw this in a tweet by Tim O’Reilly.

April 9, 2013

Improving Twitter search with real-time human computation [“semantics supplied”]

Filed under: Human Computation,Search Engines,Searching,Semantics,Tweets — Patrick Durusau @ 1:54 pm

Improving Twitter search with real-time human computation by Edwin Chen.

From the post:

Before we delve into the details, here’s an overview of how the system works.

(1) First, we monitor for which search queries are currently popular.

Behind the scenes: we run a Storm topology that tracks statistics on search queries.

For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.

(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.

Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.

For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.

Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Let’s now explore the first two sections above in more detail.

….

The post is quite awesome and I suggest you read it in full.

This resonates with a recent comment about Lotus Agenda.

The short version is a user creates a thesaurus in Agenda that enables searches enriched by the thesaurus. The user supplied semantics to enhance the searches.

In the Twitter case, human reviewers supply semantics to enhance the searches.

In both cases, Agenda and Twitter, humans are supplying semantics to enhance the searches.

I emphasize “supplying semantics” as a contrast to mechanistic searches that rely on text.

Mechanistic searches can be quite valuable but they pale beside searches where semantics have been “supplied.”

The Twitter experience is a an important clue.

The answer to semantics for searches lies somewhere between ask an expert (you get his/her semantics) and ask ask all of us (too many answers to be useful).

More to follow.

Springer Book Archives [Proposal for Access]

Filed under: Archives,Books — Patrick Durusau @ 11:16 am

The Springer Book Archives now contain 72,000 titles

From the post:

Today at the British UKSG Conference in Bournemouth, Springer announced that the Springer Book Archives (SBA) now contain 72,000 eBooks. This news represents the latest developments in a project that seeks to digitize nearly every Springer book ever published, dating back to 1842 when the publishing company was founded. The titles are being digitized and made available again for the scientific community through SpringerLink (link.springer.com), Springer’s online platform.

By the end of 2013 an unprecedented collection of around 100,000 historic, scholarly eBooks, in both English and German, will be available through the SBA. Researchers, students and librarians will be able to access the full text of these books free of any digital rights management. Springer also offers a print-on-demand option for most of the books.

Notable authors whose works Springer has published include high-level researchers and Nobel laureates, such as Werner von Siemens, Rudolf Diesel, Emil Fischer and Marie Curie.Their publications will be a valuable addition to this historic online archive.

SBA section at Springer: http://www.springer.com/bookarchives

A truly remarkable achievement but access will remain problematic for a number of potential users.

I would like to see the United States government purchase (as in pay an annual fee) unlimited access to SpringerLink for any U.S. based IP address.

Springer gets more revenue than it does now from U.S. domains, reduces Springer’s licensing costs, benefits all colleges and universities, and provides everyone in the U.S. with access to first rate technical publications.

Not to mention that Springer gets the revenue from selling the print-on-demand paperback editions.

Seems like a no-brainer if you are looking to jump start a knowledge economy.

PS: Forward this to your Senator/Representative. Could be a viable model to satisfy the needs of publishers and readers.

I first saw this at: Springer Book Archives Now Contain 72,000 eBooks by Africa S. Hands.

Open Access Theses and Dissertations

Filed under: Theses/Dissertations,Topic Maps — Patrick Durusau @ 11:01 am

Open Access Theses and Dissertations

From the webpage:

OATD aims to be the best possible resource for finding open access graduate theses and dissertations published around the world. Metadata (information about the theses) comes from over 600 colleges, universities, and research institutions. OATD currently indexes over 1.5 million theses and dissertations.

A search for “topic maps” as a phrase turns up thirty-six matches.

Try it for yourself: OATD search: “topic maps

Yes, the total for RDF is substantially higher.

But I take that as an incentive to do a better job spreading the word about topic maps in academic circles.

Enjoy!

I first saw this at: Theses and Dissertations Available Through New Open Access Tool by Africa S. Hands.

Spring Cleaning Data: 1 of 6… [Federal Reserve]

Filed under: Government,Government Data,R — Patrick Durusau @ 10:46 am

Spring Cleaning Data: 1 of 6 – Downloading the Data & Opening Excel Files

From the post:

With spring in the air, I thought it would be fun to do a series on (spring) cleaning data. The posts will follow my efforts to to download the data, import into R, cleaned it up, merge the different files, add columns of information created, and then a master file exported. During the process I will be offering at times different ways to do things, this is an attempt to show how there is no one way of doing something, but there are several. When appropriate I will demonstrate as many as I can think of, given the data.

This series of posts will be focusing on the Discount Window of the Federal Reserve. I know I seem to be picking on the Feds, but I am genuinely interested in what they have. The fact that there is data on the discount window is, to be blunt, took legislation from congress to get. The first step in this project was to find the data. The data and additional information can be downloaded here.

I don’t have much faith in government data but if you are going to debate on the “data,” such as it is, you will need to clean it up and combine it with other data.

This is a good start in that direction for data from the Federal Reserve.

If you are interested in data from other government agencies, publishing the steps needed to clean/combine their data would move everyone forward.

A topic map of cleaning directions for government data could be a useful tool.

Not that clean data = government transparency but it might make it easier to spot the shadows.

Astrophysical data mining with GPU…

Filed under: Astroinformatics,BigData,Data Mining,Genetic Algorithms,GPU — Patrick Durusau @ 10:02 am

Astrophysical data mining with GPU. A case study: genetic classification of globular clusters by Stefano Cavuoti, Mauro Garofalo, Massimo Brescia, Maurizio Paolillo, Antonio Pescape’, Giuseppe Longo, Giorgio Ventre.

Abstract:

We present a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200x in the training phase with respect to the CPU based version.

BTW, DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html.

In case you are curious about the application of genetic algorithms in a low signal/noise situation with really “big” data, this is a good starting point.

Makes me curious about the “noise” in other communications.

The “signal” is fairly easy to identify in astronomy, but what about in text or speech?

I suppose “background noise, music, automobiles” would count as “noise” on a tape recording of a conversation, but is there “noise” in a written text?

Or noise in a conversation that is clearly audible?

If we have 100% signal, how do we explain failing to understand a message in speech or writing?

If it is not “noise,” then what is the problem?

Homology Theory — A Primer

Filed under: Homology,Mathematics — Patrick Durusau @ 9:48 am

Homology Theory — A Primer by Jeremy Kun.

From the post:

This series on topology has been long and hard, but we’re are quickly approaching the topics where we can actually write programs. For this and the next post on homology, the most important background we will need is a solid foundation in linear algebra, specifically in row-reducing matrices (and the interpretation of row-reduction as a change of basis of a linear operator).

Last time we engaged in a whirlwind tour of the fundamental group and homotopy theory. And we mean “whirlwind” as it sounds; it was all over the place in terms of organization. The most important fact that one should take away from that discussion is the idea that we can compute, algebraically, some qualitative features about a topological space related to “n-dimensional holes.” For one-dimensional things, a hole would look like a circle, and for two dimensional things, it would look like a hollow sphere, etc. More importantly, we saw that this algebraic data, which we called the fundamental group, is a topological invariant. That is, if two topological spaces have different fundamental groups, then they are “fundamentally” different under the topological lens (they are not homeomorphic, and not even homotopy equivalent).

Unfortunately the main difficulty of homotopy theory (and part of what makes it so interesting) is that these “holes” interact with each other in elusive and convoluted ways, and the algebra reflects it almost too well. Part of the problem with the fundamental group is that it deftly eludes our domain of interest: we don’t know a general method to compute the damn things!

Jeremy continues his series on topology and promises programs are not far ahead!

« Newer PostsOlder Posts »

Powered by WordPress