Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 6, 2011

SmartData Collective

Filed under: Business Intelligence,Data Management — Patrick Durusau @ 7:16 pm

SmartData Collective

From the about page:

SmartData Collective, an online community moderated by Social Media Today, provides enterprise leaders access to the latest trends in Business Intelligence and Data Management. Our innovative model serves as a platform for recognized, global experts to share their insights through peer contributions, custom content publishing and alignment with industry leaders. SmartData Collective is a key resource for executives who need to make informed data management decisions.

Maybe a bit more mainstream than what you are accustomed to but think of it as a cross-cultural experience. šŸ˜‰

Seriously, effective promotion of topic maps means pitching them as solving problems as seen by others, not ourselves.

Hadoop Fatigue — Alternatives to Hadoop

Filed under: GraphLab,Hadoop,HPCC,MapReduce,Spark,Storm — Patrick Durusau @ 7:15 pm

Hadoop Fatigue — Alternatives to Hadoop

Can you name six (6) alternatives to Hadoop? Or formulate why you choose Hadoop over those alternatives?

From the post:

After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.

  • Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.
  • Documentation for the bloated Java API is sufficient, but not the most helpful.
  • HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.
  • Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!
  • Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.
  • Large clusters require a dedicated team to keep it running properly, but that is not surprising.
  • Writing a Hadoop job becomes a software engineering task rather than a data analysis task.

Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:

Out of the six alternatives, I haven’t seen BashReduce or Disco, so I need to look those up.

Ah, the other alternatives: GraphLab, HPCC, Spark, and Preview of Storm: The Hadoop of Realtime Processing.

It is a pet peeve of mine that some authors force me to search for links they could have just as well entered. The New York Times of all places, refers to websites and does not include the URLs. And that is for paid subscribers.

SIGKDD 2011 Conference

A pair of posts from Ryan Rosario on the SIGKDD 2011 Conference.

Day 1 (Graph Mining and David Blei/Topic Models)

Tough sledding on Probabilistic Topic Models but definitely worth the effort to follow.

Days 2/3/4 Summary

Useful summaries and pointers to many additional resources.

If you attended SIGKDD 2011, do you have pointers to other reviews of the conference or other resources?

I added a category for SIGKDD.

Electronic Statistics Textbook

Filed under: Mathematics,Statistics — Patrick Durusau @ 7:02 pm

Electronic Statistics Textbook

From the website:

The only Internet Resource about Statistics Recommended by Encyclopedia Britannica

StatSoft has freely provided the Electronic Statistics Textbook as a public service for more than 12 years now.

This Textbook offers training in the understanding and application of statistics. The material was developed at the StatSoft R&D department based on many years of teaching undergraduate and graduate statistics courses and covers a wide variety of applications, including laboratory research (biomedical, agricultural, etc.), business statistics, credit scoring, forecasting, social science statistics and survey research, data mining, engineering and quality control applications, and many others.

The Electronic Textbook begins with an overview of the relevant elementary (pivotal) concepts and continues with a more in depth exploration of specific areas of statistics, organized by “modules” and accessible by buttons, representing classes of analytic techniques. A glossary of statistical terms and a list of references for further study are included.

Proper citation
(Electronic Version): StatSoft, Inc. (2011). Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB: http://www.statsoft.com/textbook/. (Printed Version): Hill, T. & Lewicki, P. (2007). STATISTICS: Methods and Applications. StatSoft, Tulsa, OK.

This is going to get a bookmark for sure!

Sage Bionetworks Synapse Project – Webinar – Weds. 7 Sept. 2011

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:02 pm

Sage Bionetworks Synapse Project – Webinar – Weds. 7 Sept. 2011

Call-in Details:

——————————————————-
To join the online meeting (Now from mobile devices!)
——————————————————-
1. Go to https://stanford.webex.com/stanford/j.php?ED=107799137&UID=0&PW=NNjE3OWYzODk3&RT=MiM0
2. If requested, enter your name and email address.
3. If a password is required, enter the meeting password: ncbo
4. Click “Join”.

——————————————————-
To join the audio conference only
——————————————————-
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll number (US/Canada): 1-650-429-3300
Global call-in numbers: https://stanford.webex.com/stanford/globalcallin.php?serviceType=MC&ED=107799137&tollFree=0

Access code:926 719 478

Abstract:

The recent exponential growth of biological ā€œomicsā€ data has occurred concurrently with a decline in the number of New Molecular Entities approved by the FDA, proving that biological research productivity does not scale with biological data generation and the analysis and interpretation of genomic data is a bottleneck in the development of new treatments. Sage Bionetworksā€™ mission is to catalyze a cultural transition from the traditional single lab, single-company, and single-therapy R&D paradigm to a model with broad precompetitive collaboration on the analysis of large scale data in medical sciences. Part of Sageā€™s solution is Synapse, a platform for open, reproducible data-driven science, which will support the reusability of information facilitated by ontology-based services and applications directed at scientific researchers and data curators. Sage Bionetworks is actively pursuing the acquisition, curation, statistical quality control, and hosting of datasets that integrate both clinical phenotype and genomic data along with an intermediate molecular layer such as gene expression or proteomic data. We expect hosting these sorts of unique, integrative, high value datasets in the public domain on Synapse will seed a variety of analytical approaches to drive new treatments based on better understanding of disease states and the biological effects of existing drugs. In this webinar, Dr. Michael Kellen, Director of Technology at Sage Bionetworks will provide a demonstration of an alpha version of the Synapse platform, and discuss its application to clinical science.

Interesting claim about the decline in the number of New Molecular Entities (NMEs) approved by the FDA, see: NMEs approved by CDER. Approvals are on average about the same. But then applications for NMEs have to be filed in order to be approved.

Just for background reading, you might want to look at: New Chemical Entity over at Wikipedia.

Or, The Scope of New Chemical Entity Exclusivity and FDAā€™s ā€œUmbrellaā€ Exclusivity Policy

I don’t disagree that better data analysis tools are needed but remain puzzled what the FDA approval rate for NMEs has to do with the problem.

Improving Entity Resolution with Global Constraints

Filed under: Data Integration,Data Mining,Entity Resolution — Patrick Durusau @ 7:00 pm

Improving Entity Resolution with Global Constraints by Jim Gemmell, Benjamin I. P. Rubinstein, and Ashok K. Chandra.

Abstract:

Some of the greatest advances in web search have come from leveraging socio-economic properties of online user behavior. Past advances include PageRank, anchor text, hubs-authorities, and TF-IDF. In this paper, we investigate another socio-economic property that, to our knowledge, has not yet been exploited: sites that create lists of entities, such as IMDB and Netflix, have an incentive to avoid gratuitous duplicates. We leverage this property to resolve entities across the different web sites, and find that we can obtain substantial improvements in resolution accuracy. This improvement in accuracy also translates into robustness, which often reduces the amount of training data that must be labeled for comparing entities across many sites. Furthermore, the technique provides robustness when resolving sites that have some duplicates, even without first removing these duplicates. We present algorithms with very strong precision and recall, and show that max weight matching, while appearing to be a natural choice turns out to have poor performance in some situations. The presented techniques are now being used in the back-end entity resolution system at a major Internet search engine.

Relies on entity resolution that has been performed in another context. I rather like that, as opposed to starting at ground zero.

I was amused that “adult titles” were excluded from the data set. I don’t have the numbers right off hand but “adult titles” account for a large percentage of movie income. Not unlike using stock market data but excluding all finance industry stocks. Seems incomplete.

Berlin Buzzwords 2011 – Slides/Videos

Filed under: Indexing,NoSQL — Patrick Durusau @ 6:59 pm

Berlin Buzzwords 2011 – Slides/Videos

I listed the slides and presentations together and sorted the listing by author. A number of very good presentations.

BTW, congratulations to the organizers of Berlin Buzzwords! Truly awesome gathering of talent.

I created this listing to assist myself in mining the presentations. Please forward any corrections. Enjoy!

September 5, 2011

Palin, Bachmann, and the Internal Welfare Code (aka, Internal Revenue Code)

Filed under: Government Data,Marketing,Topic Maps — Patrick Durusau @ 8:02 pm

Sarah Palin and Rep. Michelle Bachmann (R-Minnesota) support a 0% corporate tax rate and closing corporate loopholes in the Internal Revenue Code.*

Those cheering are more interested in the 0% corporate tax rate than closing corporate loopholes.

Truth be told, it should be called the Internal Welfare Code (IWC) as most of its provisions are loopholes for one group or another.

That makes tax reform hard because it is welfare reform. To have reform, someone has to give up their welfare benefits.

When welfare/tax provisions are written into the IWC/IRC, reports are prepared on the cost in revenue for those provisions. It often is easy to see who benefits from them.

Now there is a topic map project. Mapping the provisions of the IWC/IRC to the reports on “cost in revenue” for those provisions and identifying those who benefit from them. From that mapping you could produce a color-coded IWC/IRC that has the loopholes/provisions for each group identified by color. Or even re-organize the IWC/IRC by color so the loopholes for each group can be roughly compared.

That would be government transparency with bite!

PS: If you know of any government transparency project that would be interested, please pass this along. Or any candidate for that matter.

*The logic closing corporate loopholes to a 0% tax escapes me. But, I am not running for President of the United States.

< NIEM > National Information Exchange Model

< NIEM > National Information Exchange Model

From the technical introduction:

NIEM provides a common vocabulary for consistent, repeatable exchanges of information between agencies and domains. The model is represented in a number of forms, including a data dictionary and a reference schema, and includes the body of concepts and rules that underlie its structure, maintain its consistency, and govern its use.

NIEM is a comprehensive resource for organizations to successfully exchange information, offering tools, terminology, help, training, governance, and an active community of users.

NIEM uses extensible markup language (XML), which allows the structure and meaning of data to be defined through simple but carefully defined syntax rules and provides a common framework for information exchange.

The model’s unique architecture enables data components to be constrained, extended, and augmented as necessary to formulate XML exchange schemas, and XML instance documents defining the information payloads for data exchange. These exchange-defining documents are packaged in information exchange package documentation (IEPDs) that are reusable, modifiable, and extendable.

It’s Labor Day and I have yet to get the “tools” link to work. Must be load on the site. šŸ˜‰

It’s a large effort and site so it will take some time to explore it.

If you are participating in < NIEM > please give a shout.

PS: I encountered < NIEM > following a link to the 2011 National Training Event videos. Registration is required but free.

Bennett Launches Site on citeproc-js Legal Citation Features

Filed under: Legal Informatics — Patrick Durusau @ 7:35 pm

Bennett Launches Site on citeproc-js Legal Citation Features

From the post:

Professor Frank Bennett of the Nagoya University Graduate School of Law has launched CitationStylist, a new Website that provides information and tools related to the legal citation ā€œfeatures of the citeproc-js citation formatter.ā€

The CitationStlist site styles are based on “Bluebook: A Uniform System of Citation (Columbia Law Review Assā€™n et al. eds., 19th ed. 2010).” Styles used in the United States and courts therein.

US based legal topic maps will be using those styles for presentation of legal citations so this will be a very valuable tool.

Does anyone know of an equivalent tool for non-US citations?

Sartor et al. on Legislative XML for the Semantic Web

Filed under: Government Data,Legal Informatics — Patrick Durusau @ 7:34 pm

Sartor et al. on Legislative XML for the Semantic Web from the Legalinformatics Blog.

Legislative XML for the Semantic Web: Principles, Models, Standards for Document Management (Springer 2011), a collection of scholarly articles on the use of XML and Semantic Web technologies in connection with legislative information systems, has been published.

Should be of interest for anyone working on topic maps and legislative information systems.

How Hard is the Local Search Problem?

Filed under: Geographic Information Retrieval,Local Search,Mapping,Searching — Patrick Durusau @ 7:33 pm

How Hard is the Local Search Problem? by Matthew Hurst.

The “local search” problem that Matthew is addressing is illustrated with Google’s mapping of local restaurants in Matthew’s neighborhood.

The post starts:

The local search problem has two key components: data curation (creating and maintaining a set of high quality statements about what the world looks like) and relevance (returning those statements in a manner that satisfies a user need. The first part of the problem is a key enabler to success, but how hard is it?

There are many problems which involve bringing together various data sources (which might be automatically or manually created) and synthesizing an improved set of statements intended to denote something about the real world. The way in which we judge the results of such a process is to take the final database, sample it, and test it against what the world looks like.

In the local search space, this might mean testing to see if the phone number in a local listing is indeed that associated with a business of the given name and at the given location.

But do we quantify this challenge? We might perform the above evaluation and find out that 98% of the phone numbers are correctly associated. Is that good? Expected? Poor?

After following Matthew through his discussion of the various factors in “local search,” what are your thoughts on Google’s success with “local search?”

Could you do better?

How? Be specific, a worked example would be even more convincing.

Erlang – 3 Slide decks

Filed under: Actor-Based,Erlang — Patrick Durusau @ 7:28 pm

I encountered three (3) slide decks on Erlang today:

Mohamed Samy presents two sessions on Erlang:

Erlang Session 1 – General introduction, sequential Erlang.

Erlang Session 2 – Concurrency, Actors

Despite the titles, there was no session 3.

Which writing those up, I saw:

Concurrency Oriented Programming in Erlang A more advanced view of Erlang and its possibilities.

Ontopia now supports numbers in tolog

Filed under: Ontopia,tolog — Patrick Durusau @ 7:28 pm

Ontopia now supports numbers in tolog

Peter-Paul Kruijsen announces support for numbers in tolog:

Over the past years Ontopia was not able to work with numbers. Sorting e.g. a list of occurrence values could result in ‘123’, ’45’, ‘6’, ’78’. For one of our customers, we needed an implementation that would sort these as ‘6’, ’45’, ’78’, ‘123’, as well as an implementation for adding, subtracting, multiplying and diving numbers.

I have just committed NumbersModule to Ontopia [1]. It will be available in the upcoming 5.2.0 release. With this addition, tolog queries can now work with numbers. It supports these predicates:

  • value(string, result): Parses a string (e.g. an occurrence value) into a number. Pattern and locale are optional, e.g. to parse ‘ā‚¬ 5.026,34’ for us Europeans
  • format(number, result): Formats a number into a string. Pattern and locale are optional, e.g. to format a number as ‘42.6 %’.
  • absolute(result, number): Calculates the absolute value of a number.
  • add(result, number, number): Adds numbers. Providing more than 2 input values is allowed.
  • subtract(result, number, number): Subtracts numbers. Providing more than 2 input values is allowed.
  • multiply(result, number, number): Multiplies numbers. Providing more than 2 input values is allowed.
  • divide(result, number, number): Divides numbers. Providing more than 2 input values is allowed.
  • min(result, number, number): Calculates the minimum value. Providing more than 2 input values is allowed.
  • max(result, number, number): Calculates the maximum value. Providing more than 2 input values is allowed.

….

[1]: http://code.google.com/p/ontopia/source/detail?r=2182

See the post for full details.

Thanks Peter-Paul!

September 4, 2011

Semantic Integration in the IFF

Filed under: Category Theory,Ontology — Patrick Durusau @ 7:20 pm

Semantic Integration in the IFF by Robert E. Kent

Abstract:

The IEEE P1600.1 Standard Upper Ontology (SUO) project aims to specify an upper ontology that will provide a structure and a set of general concepts upon which domain ontologies could be constructed. The Information Flow Framework (IFF), which is being developed under the auspices of the SUO Working Group, represents the structural aspect of the SUO. The IFF is based on category theory. Semantic integration of object-level ontologies in the IFF is represented with its fusion construction. The IFF maintains ontologies using powerful composition primitives, which includes the fusion construction.

Comments: Presented at the Semantic Integration Workshop of the 2nd International Semantic Web Conference (ISWC2003), Sanibel Island, Florida, October 20, 2003.

IFF = Information Flow Framework. From, Barwise, Jon and Jerry Seligman. Information Flow: The Logic of Distributed Systems. Cambridge Tracts in Theoretical Computer Science 44. Cambridge University Press. 1997.

Historical document at this point but interesting none the less. Describes a category theory view of semantic integration.

Visualizing Bayes’ Theorem

Filed under: Bayesian Models — Patrick Durusau @ 7:14 pm

Visualizing Bayes’ Theorem by Oscar Bonilla.

Uses Venn diagrams to construct a visual derivation of Bayes’ theorem.

September 3, 2011

Topic Map Opportunity: Financial Crimes

Filed under: Marketing,Topic Maps — Patrick Durusau @ 6:48 pm

As in investigating them.*

The need for identity resolution is alive and well. Here is an example of one market for the results that topic maps deliver.

Investigating Financial Crimes: Looking for Parts of Needles Over Multiple Haystacks?

From the webpage:

The International Association of Financial Crimes Investigators (IAFCI) annual conference begins next week in Charlotte, NC. The association, a non-profit international organization, provides an environment within which information about financial fraud, fraud investigation and fraud prevention methods can be collected, exchanged and taught for the common good of the industry. Infoglide Software Corporation is a proud sponsor of IAFCI and will be attending this yearā€™s event. We invite all of our friends – and future customers – to come visit us at Booth 105. We would love to see you there. The conference begins on Monday, August 29th and runs through Thursday, September 2nd at the Charlotte Convention Center.

IAFCI has members across the world in every major continent, broken down by about one third law enforcement, one third banking and one third retail and service members. The membership dovetails nicely with Infoglideā€™s customer base. With a presence in major retail organizations, top global banks and mission critical government agencies, it is evident that Infoglideā€™s Identity Resolution Engine (IRE) is a tool that financial crimes investigators are excited about.

If youā€™re in the business of detecting and investigating financial crimes, AML and fraud, you know what itā€™s like to perform endless searches into disparate data sources looking for that golden nugget of information. Itā€™s worse than trying to find a needle in a haystack. In fact, the needle itself is usually spread across several haystacks. Fortunately, Infoglideā€™s patented IRE software helps financial crimes investigators quickly identify ā€˜persons of interestā€™ within those haystacks of data. Hereā€™s how:

  1. Enterprise-wide Identity Resolution: allows single-request searching into multiple databases without the need to move or clean the data. Accounting for variations in names, addresses and other attributes, it eliminates time and effort in triaging fraud cases, and allows analysts to focus on the high-return cases.
  2. Social Link Discovery: looks at non-obvious relationships between individuals across databases. By understanding, for example, that a loan applicant shares an address with the loans officer, and also shares a telephone number with a known fraudster, a company can gain immediate insight into the risks associated with that transaction.
  3. Anonymous Resolution for data Privacy: allows organizations to productively search into restricted databases without violating international data privacy laws. The analyst can understand if a match was ā€˜likelyā€™ found in the restricted data, without ever seeing or retrieving the actual results.
  4. Real Time Read Flag Analysis: is the proactive implementation of the technology that looks at incoming transaction and compares them to internal and third party databases to understand possible identity matches and non obvious relationships. If one is found, the software triggers an instant alert.

So, if you or someone you know is heading to the conference, please stop by to meet us. Itā€™s possible those haystacks arenā€™t quite as intimidating as you thought.

What needles in haystacks are you finding?


* That and other uses, inquire.

Collaborating with Selfish People

Filed under: Collaboration — Patrick Durusau @ 6:47 pm

Topic maps don’t require collaboration to be authored or maintained but unless your client has an unlimited budget, collaboration is one way to extend the reach and utility of your topic map.

The question is how to engender cooperation in an environment populated by selfish users? (US intelligence services being a good example.)

I ran across a grant summary by Jared Saia of the University of New Mexico:

Beyond Tit-for-Tat: New Techniques for Collaboration in Network Security Games

which reads in part:

Motivation and Problem: How can we ensure collaboration on the Internet, where populations are highly fluctuating, selfish, and unpredictable? We propose a new algorithmic technique for enabling collaboration in network security games. Our technique, Secure Multiparty Mediation (SMM), improves on past approaches such as tit-for-tat in the following ways: (1) it works even in single round games; (2) it works even when the actions of players are never revealed; (3) it works even in the presence of churn, i.e. players joining and leaving the game.

It impressed the NSF: Award Abstract #1017509.

Then I found:

Scalable Mechanisms for Rational Secret Sharing.

You probably want to watch Jared Saia’s homepage and publications.

An attempt to create a solution that doesn’t involve changing human nature. The latter being remarkably resistant to change. Just ask the Catholic Church.

Decision Support for e-Governance: A Text Mining Approach

Filed under: Data Mining,eGov,Text Extraction — Patrick Durusau @ 6:47 pm

Decision Support for e-Governance: A Text Mining Approach by G.Koteswara Rao, and Shubhamoy Dey.

Abstract:

Information and communication technology has the capability to improve the process by which governments involve citizens in formulating public policy and public projects. Even though much of government regulations may now be in digital form (and often available online), due to their complexity and diversity, identifying the ones relevant to a particular context is a non-trivial task. Similarly, with the advent of a number of electronic online forums, social networking sites and blogs, the opportunity of gathering citizens’ petitions and stakeholders’ views on government policy and proposals has increased greatly, but the volume and the complexity of analyzing unstructured data makes this difficult. On the other hand, text mining has come a long way from simple keyword search, and matured into a discipline capable of dealing with much more complex tasks. In this paper we discuss how text-mining techniques can help in retrieval of information and relationships from textual data sources, thereby assisting policy makers in discovering associations between policies and citizens’ opinions expressed in electronic public forums and blogs etc. We also present here, an integrated text mining based architecture for e-governance decision support along with a discussion on the Indian scenario.

The principles of subject identity could usefully inform many aspects of this “project.” I hesitate to use the word “project” for an effort that will eventually involve twenty-two (22) official languages, several scripts and governance of several hundred million people.

A good starting point for learning about the issues facing e-Governance in India.

Redis for processing payments

Filed under: NoSQL,Redis — Patrick Durusau @ 6:46 pm

Redis for processing payments

Not a complete payment or even work-flow system but enough to make you think about how to use Redis in such a situation.

Schema VOAG

Filed under: Attribution,Governance,Ontology — Patrick Durusau @ 6:46 pm

Schema VOAG

From the website:

VOAG stands for “Vocabulary Of Attribution and Governance”. The ontology is intended to specify licensing, attribution, provenance and governance of an ontology. VOAG captures many common license types and their restrictions. Where a license requires attribution, VOAG provides resources that allow the attribution should be made. Provenance is defined in terms of source and pedigree. A miminal model of governance is provided based on how issues, releases and changes are managed. VOAG does not import, but makes uses of some concepts from VOID (http://vocab.deri.ie/void), notably void:Dataset.

DiscoverText

Filed under: Data Analysis,Data Mining,DiscoverText — Patrick Durusau @ 6:46 pm

DiscoverText

From the website:

DiscoverText helps you gain valuable insight about customers, products, employees, citizens, research data, and more through powerful text analytic methods. DiscoverText combines search, human judgments and inferences with automated software algorithms to create an active machine-learning loop.

DiscoverText is currently used for text analytics, market research, eDiscovery, FOIA processing, employee engagement analytics, health informatics, processing public comments by government agencies and university basic research.

Interesting tool set, based in the cloud.

PCAT

Filed under: Data Analysis,Data Mining,PCAT — Patrick Durusau @ 6:45 pm

PCAT – Public Comment Analysis Toolkit

A cloud based analysis service.

PCAT can import:

Federal Docket Management System Archives
Email, Blog and Wiki Content
Plain text, HTML, or XML Documents
Microsoft Word and Adobe PDFs
Excel or CSV Spreadsheets
Archived RSS Feeds
CAT-style Datasets

PCAT capabilities:

Search for key concepts & code text
Remove duplicates & cluster similar comments
Form peer & project networks
Establish credentials & permissions
Assign multiple coders to tasks
Annotate coding with shared memos
Easily measure inter-coder reliability
Adjudicate valid & invalid coder decisions
Generate reports in RTF, CSV, PDF or XML format
Archive or share completed projects online

If you have used PCAT, please comment.

September 2, 2011

Improving the recall of decentralised linked data querying through implicit knowledge

Filed under: Linked Data,LOD,SPARQL — Patrick Durusau @ 8:02 pm

Improving the recall of decentralised linked data querying through implicit knowledge by JĆ¼rgen Umbrich, Aidan Hogan, Axel and Polleres.

Abstract:

Aside from crawling, indexing, and querying RDF data centrally, Linked Data principles allow for processing SPARQL queries on-the-fly by dereferencing URIs. Proposed link-traversal query approaches for Linked Data have the benefits of up-to-date results and decentralised (i.e., client-side) execution, but operate on incomplete knowledge available in dereferenced documents, thus affecting recall. In this paper, we investigate how implicit knowledge – specifically that found through owl:sameAs and RDFS reasoning – can improve the recall in this setting. We start with an empirical analysis of a large crawl featuring 4 m Linked Data sources and 1.1 g quadruples: we (1) measure expected recall by only considering dereferenceable information, (2) measure the improvement in recall given by considering rdfs:seeAlso links as previous proposals did. We further propose and measure the impact of additionally considering (3) owl:sameAs links, and (4) applying lightweight RDFS reasoning (specifically {\rho}DF) for finding more results, relying on static schema information. We evaluate our methods for live queries over our crawl.

From the document:

owl:sameAs links are used to expand the set of query relevant sources, and owl:sameAs rules are used to materialise implicit knowledge given by the OWL semantics, potentially generating additional answers.

I have always thought that knowing the “why” an owl:sameAs would make it more powerful. But since any basis for subject sameness can be used, that may not be the case.

Discovering, Summarizing and Using Multiple Clusterings

Filed under: Clustering,Summarization — Patrick Durusau @ 8:00 pm

Discovering, Summarizing and Using Multiple Clusterings

Proceedings of the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings

Athens, Greece, September 5, 2011.

Where you will find:

Invited Talks

1. Combinatorial Approaches to Clustering and Feature Selection, Michael E. Houle

2. Cartification: Turning Similarities into Itemset Frequencies, Bart Goethals

Research Papers

3. When Pattern Met Subspace Cluster, Jilles Vreeken, Arthur Zimek

4. Fast Multidimensional Clustering of Categorical Data, Tengfei Liu, Nevin L. Zhang, Kin Man Poon, Yi Wang, Hua Liu

5. Factorial Clustering with an Application to Plant Distribution Data, Manfred Jaeger, Simon Lyager, Michael Vandborg, Thomas Wohlgemuth

6. Subjectively Interesting Alternative Clusters,Tijl De Bie

7. Evaluation of Multiple Clustering Solutions, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek

8. Browsing Robust Clustering-Alternatives, Martin Hahmann, Dirk Habich, Wolfgang Lehner

9. Generating a Diverse Set of High-Quality Clusterings, Jeff M. Phillips, Parasaran Raman, Suresh Venkatasubramanian

Discovering the Impact of Knowledge in Recommender Systems: A Comparative Study

Filed under: Recommendation,Semantics — Patrick Durusau @ 7:59 pm

Discovering the Impact of Knowledge in Recommender Systems: A Comparative Study by Bahram Amini, Roliana Ibrahim, and Mohd Shahizan Othman.

Abstract:

Recommender systems engage user profiles and appropriate filtering techniques to assist users in finding more relevant information over the large volume of information. User profiles play an important role in the success of recommendation process since they model and represent the actual user needs. However, a comprehensive literature review of recommender systems has demonstrated no concrete study on the role and impact of knowledge in user profiling and filtering approache. In this paper, we review the most prominent recommender systems in the literature and examine the impression of knowledge extracted from different sources. We then come up with this finding that semantic information from the user context has substantial impact on the performance of knowledge based recommender systems. Finally, some new clues for improvement the knowledge-based profiles have been proposed.

Interesting work but I am uncertain about the need to “extract” semantic information from users. At least directly. As in linguistics, it may be enough to see where the user falls statistically and use that as a guide to the semantics. As in linguistics, it will miss the edge cases but those are likely to be missed anyway.

Category-Based Routing in Social Networks:…

Filed under: Identity,Networks,Social Networks — Patrick Durusau @ 7:58 pm

Category-Based Routing in Social Networks: Membership Dimension and the Small-World Phenomenon (Short) by David Eppstein, Michael T. Goodrich, Maarten Lƶffler, Darren Strash, and Lowell Trott.

Abstract:

A classic experiment by Milgram shows that individuals can route messages along short paths in social networks, given only simple categorical information about recipients (such as “he is a prominent lawyer in Boston” or “she is a Freshman sociology major at Harvard”). That is, these networks have very short paths between pairs of nodes (the so-called small-world phenomenon); moreover, participants are able to route messages along these paths even though each person is only aware of a small part of the network topology. Some sociologists conjecture that participants in such scenarios use a greedy routing strategy in which they forward messages to acquaintances that have more categories in common with the recipient than they do, and similar strategies have recently been proposed for routing messages in dynamic ad-hoc networks of mobile devices. In this paper, we introduce a network property called membership dimension, which characterizes the cognitive load required to maintain relationships between participants and categories in a social network. We show that any connected network has a system of categories that will support greedy routing, but that these categories can be made to have small membership dimension if and only if the underlying network exhibits the small-world phenomenon.

So, if identity is a social construct and the result of small-world networks, then we may need a different kind of precision (from scientific measurement) to identify subjects.

Perhaps the reverse of 20-questions, how many questions do we need for a particular subject? Does anyone remember if there was a common number of questions that were sufficient for the 20-questions game?

Welcome to The Matrix Factorization Jungle

Filed under: Data Mining,Matrix — Patrick Durusau @ 7:55 pm

Welcome to The Matrix Factorization Jungle [ A living documention on the state of the art algorithms dedicated to matrix factorization ]

From the webpage:

Matrix Decompositions has a long history and generally centers around a set of known factorizations such as LU, QR, SVD and eigendecompositions. With the advent of new methods based on random projections and convex optimization that started in part in the compressive sensing literature, we are seeing a surge of very diverse algorithms dedicated to many different kinds of matrix factorizations with constraints based on rank, positivity, sparsity,… As a result of this large increase in interest, I have decided to keep a list of them here following the success of the big picture in compressive sensing.

If you are unfamiliar with the use of matrices in data mining, consider Non-negative matrix factorization and the examples cited under Text mining.

Groonga

Filed under: Column-Oriented,NoSQL,Search Engines,Searching — Patrick Durusau @ 7:54 pm

Groonga

From the webpage:

Groonga is an open-source fulltext search engine and column store. It lets you write high-performance applications that requires fulltext search.

The latest release is 1.2.5, released 2011-08-29.

Most of the documentation is in Japanese so I can’t comment on it.

Think of this as an opportunity to (hopefully) learn some Japanese. Given the rate of computer science research in Japan it will not be wasted effort.

PS: If you already read Japanese, feel free to contribute some comments on Groonga.

Federal Register (US)

Filed under: Data Source,Law - Sources — Patrick Durusau @ 7:53 pm

Federal Register (US)

From the developers webpage for the Federal Register (US):

Project Source Code

FederalRegister.gov is a fully open source project; on GitHub you can find the source code for the main site, the chef cookbooks for maintaining the servers, and the WordPress themes and configuration. We welcome your contributions and feedback.

API

While the API is still a work in progress, weā€™ve designed it to be as easy-to-use as possible:

  • It comes pre-processed; the data provided is a combination of data from the GPO MODS (metadata) files and the GPO bulkdata files and has gone through our cleanup procedures.
  • Weā€™re using JSON as a lighter-weight, more web-friendly data transfer format
  • No API keys are needed; all you need is an HTTP client or browser.
  • The API is fully RESTful; URLs are provided to navigate to the full details or to the next page of results (HATEOAS).
  • A simple JSONP interface is also possible; simply add a `callback=foo` CGI parameter to the end of any URL to have the results be ready for cross-domain JavaScript consumption

See the webpage for Endpoints, Search Functionality, Ruby API Client and Usage Restrictions.

For those of you who are unfamiliar with the Federal Register:

The Office of the Federal Register informs citizens of their rights and obligations, documents the actions of Federal agencies, and provides a forum for public participation in the democratic process. Our publications provide access to a wide range of Federal benefits and opportunities for funding and contain comprehensive information about the various activities of the United States Government. In addition, we administer the Electoral College for Presidential elections and the Constitutional amendment process.

The Federal Register is updated daily by 6 a.m. and is published Monday through Friday, except Federal holidays, and consists of four types of entries.

  • Presidential Documents, including Executive orders and proclamations.
  • Rules and Regulations, including policy statements and interpretations of rules.
  • Proposed Rules, including petitions for rulemaking and other advance proposals.
  • Notices, including scheduled hearings and meetings open to the public, grant applications, administrative orders, and other announcements of government actions.

We recommend reading the ā€œLearnā€ pages of this site for more on the structure and value of the Federal Register and for an overview of the regulatory process.

Or as it says on their homepage: “The Daily Journal of the United States Government.”

« Newer PostsOlder Posts »

Powered by WordPress