Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 8, 2013

Splunk Enterprise 6

Filed under: Intelligence,Machine Learning,Operations,Splunk — Patrick Durusau @ 3:27 pm

Splunk Enterprise 6

The latest version of Splunk is described as:

Operational Intelligence for Everyone

Splunk Enterprise is the leading platform for real-time operational intelligence. It’s the easy, fast and secure way to search, analyze and visualize the massive streams of machine data generated by your IT systems and technology infrastructure—physical, virtual and in the cloud.

Splunk Enterprise 6 is our latest release and delivers:

  • Powerful analytics for everyone—at amazing speeds
  • Completely redesigned user experience
  • Richer developer environment to easily extend the platform

The current download page promises the enterprise version for 60 days. At the end of that period you can convert to a Free license or purchase an Enterprise license.

Elasticsearch Workshop

Filed under: ElasticSearch,JSON — Patrick Durusau @ 3:16 pm

Elasticsearch Workshop by David Pilato.

Nothing startling or new but a good introduction to Elasticsearch that you can pass along to programmers who like JSON. 😉

Nothing against JSON but “efficient” syntaxes are like using 7-bit encodings because it saves disk space.

From Algorithms to Z-Scores:…

Filed under: Algorithms,Computer Science,Mathematics,Probability,R,Statistics — Patrick Durusau @ 2:47 pm

From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science by Noram Matloff.

From the Overview:

The materials here form a textbook for a course in mathematical probability and statistics for computer science students. (It would work fine for general students too.)


“Why is this text different from all other texts?”

  • Computer science examples are used throughout, in areas such as: computer networks; data and text mining; computer security; remote sensing; computer performance evaluation; software engineering; data management; etc.
  • The R statistical/data manipulation language is used throughout. Since this is a computer science audience, a greater sophistication in programming can be assumed. It is recommended that my R tutorials be used as a supplement:

  • Throughout the units, mathematical theory and applications are interwoven, with a strong emphasis on modeling: What do probabilistic models really mean, in real-life terms? How does one choose a model? How do we assess the practical usefulness of models?

    For instance, the chapter on continuous random variables begins by explaining that such distributions do not actually exist in the real world, due to the discreteness of our measuring instruments. The continuous model is therefore just that–a model, and indeed a very useful model.

    There is actually an entire chapter on modeling, discussing the tradeoff between accuracy and simplicity of models.

  • There is considerable discussion of the intuition involving probabilistic concepts, and the concepts themselves are defined through intuition. However, all models and so on are described precisely in terms of random variables and distributions.

Another open-source textbook from Norm Matloff!

Algorithms to Z-Scores (the book).

Source files for the book available at: http://heather.cs.ucdavis.edu/~matloff/132/PLN .

Norm suggests his R tutorial, R for Programmers http://heather.cs.ucdavis.edu/~matloff/R/RProg.pdf as supplemental reading material.

To illustrate the importance of statistics, Norm gives the following examples in chapter 1:

  • The statistical models used on Wall Street made the quants” (quantitative analysts) rich— but also contributed to the worldwide fi nancial crash of 2008.
  • In a court trial, large sums of money or the freedom of an accused may hinge on whether the judge and jury understand some statistical evidence presented by one side or the other.
  • Wittingly or unconsciously, you are using probability every time you gamble in a casino— and every time you buy insurance.
  • Statistics is used to determine whether a new medical treatment is safe/e ffective for you.
  • Statistics is used to flag possible terrorists —but sometimes unfairly singling out innocent people while other times missing ones who really are dangerous.

Mastering the material in this book will put you a long way to becoming a network “statistical skeptic.”

So you can debunk mis-leading or simply wrong claims by government, industry and special interest groups. Wait! Those are also known as advertisers. Never mind.

Programming on Parallel Machines

Filed under: Parallel Programming,Parallelism,Programming — Patrick Durusau @ 1:37 pm

Programming on Parallel Machines by Norm Matloff.

From “About This Book:”

Why is this book di fferent from all other parallel programming books? It is aimed more on the practical end of things, in that:

  • There is very little theoretical content, such as O() analysis, maximum theoretical speedup, PRAMs, directed acyclic graphs (DAGs) and so on.
  • Real code is featured throughout.
  • We use the main parallel platforms OpenMP, CUDA and MPI rather than languages that at this stage are largely experimental or arcane.
  • The running performance themes communications latency, memory/network contention, load balancing and so on|are interleaved throughout the book, discussed in the context of specifi c platforms or applications.
  • Considerable attention is paid to techniques for debugging.

The main programming language used is C/C++, but some of the code is in R, which has become the pre-eminent language for data analysis. As a scripting language, R can be used for rapid prototyping. In our case here, it enables me to write examples which are much less less cluttered than they would be in C/C++, thus easier for students to discern the fundamental parallelixation principles involved. For the same reason, it makes it easier for students to write their own parallel code, focusing on those principles. And R has a rich set of parallel libraries.

It is assumed that the student is reasonably adept in programming, and has math background through linear algebra. An appendix reviews the parts of the latter needed for this book. Another appendix presents an overview of various systems issues that arise, such as process scheduling and virtual memory.

It should be note that most of the code examples in the book are NOT optimized. The primary emphasis is on simplicity and clarity of the techniques and languages used. However, there is plenty of discussion on factors that aff ect speed, such cache coherency issues, network delays, GPU memory structures and so on.

Here’s how to get the code fi les you’ll see in this book: The book is set in LaTeX, and the raw .tex files are available in http://heather.cs.ucdavis.edu/~matloff/158/PLN. Simply download the relevant fi le (the fi le names should be clear), then use a text editor to trim to the program code of interest.

In order to illustrate for students the fact that research and teaching (should) enhance each other, I occasionally will make brief references here to some of my research work.

Like all my open source textbooks, this one is constantly evolving. I continue to add new topics, new examples and so on, and of course x bugs and improve the exposition. For that reason, it is better to link to the latest version, which will always be at http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf, rather than to copy it.

If you don’t mind a bit of practical writing this early in the week, this could be the book for you!

If you read/use the book, please give feedback to the author about any bug or improvements that can be made.

That is a good way to encourage the production of open-source textbooks.

Search-Aware Product Recommendation in Solr (Users vs. Experts?)

Filed under: Interface Research/Design,Recommendation,Searching,Solr — Patrick Durusau @ 10:43 am

Search-Aware Product Recommendation in Solr by John Berryman.

From the post:

Building upon earlier work with semantic search, OpenSource Connections is excited to unveil exciting new possibilities with Solr-based product recommendation. With this technology, it is now possible to serve user-specific, search-aware product recommendations directly from Solr.

In this post, we will review a simple Search-Aware Recommendation using an online grocery service as an example of e-commerce product recommendation. In this example I have built up a basic keyword search over the product catalog. We’ve also added two fields to Solr: purchasedByTheseUsers and recommendToTheseUsers. Both fields contain lists of userIds. Recall that each document in the index corresponds to a product. Thus the purchasedByTheseUsers field literally lists all of the users who have purchased said product. The next field, recommendToTheseUsers, is the special sauce. This field lists all users who might want to purchase the corresponding product. We have extracted this field using a process called collaborative filtering, which is described in my previous post, Semantic Search With Solr And Python Numpy. With collaborative filtering, we make product recommendation by mathematically identifying similar users (based on products purchased) and then providing recommendations based upon the items that these users have purchased.

Now that the background has been established, let’s look at the results. Here we search for 3 different products using two different, randomly-selected users who we will refer to as Wendy and Dave. For each product: We first perform a raw search to gather a base understanding about how the search performs against user queries. We then search for the intersection of these search results and the products recommended to Wendy. Finally we also search for the intersection of these search results and the products recommended to Dave.

BTW, don’t miss the invitation to be an alpha tester for Solr Search-Aware Product Recommendation at the end of John’s post.

Reading John’s post it occurred to me that an alternative to mining other users’ choices, you could have an expert develop the recommendations.

Much like we use experts to develop library classification systems.

But we don’t, do we?

Isn’t that interesting?

I suspect we don’t use experts for product recommendations because we know that shopping choices depends on a similarity between consumers

We may not know what the precise nature of the similarity may be, but it is sufficient that we can establish its existence in the aggregate and sell more products based upon it.

Shouldn’t the same be true for finding information or data?

If similar (in some possibly unknown way) consumers of information find information in similar ways, why don’t we organize information based on similar patterns of finding?

How an “expert” finds information may be more “precise” or “accurate,” but if a user doesn’t follow that path, the user doesn’t find the information.

A great path that doesn’t help users find information is like having a great road with sidewalks, a bike path, cross-walks, good signage, that goes no where.

How do you incorporate user paths in your topic map application?

October 7, 2013

The NSA As Auto-Immune Disease

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 6:47 pm

Time to tame the NSA behemoth trampling our rights by Yochai Benkler.

From the post:

The spate of new NSA disclosures substantially raises the stakes of this debate. We now know that the intelligence establishment systematically undermines oversight by lying to both Congress and the courts. We know that the NSA infiltrates internet standard-setting processes to security protocols that make surveillance harder. We know that the NSA uses persuasion, subterfuge, and legal coercion to distort software and hardware product design by commercial companies.

We have learned that in pursuit of its bureaucratic mission to obtain signals intelligence in a pervasively networked world, the NSA has mounted a systematic campaign against the foundations of American power: constitutional checks and balances, technological leadership, and market entrepreneurship. The NSA scandal is no longer about privacy, or a particular violation of constitutional or legislative obligations. The American body politic is suffering a severe case of auto-immune disease: our defense system is attacking other critical systems of our body.

The NSA and its fellows are dismantling the U.S. Constitution and American culture in the name of saving us all.

While I doubt the honesty of the contractors and sycophants attached to the intelligence community by its money teat, I am sure there are many government staffers who are completely sincere in their fear of terrorism. Even though it is quite unreasonable.

Gun numbers are always soft but consider:

In 2010, guns took the lives of 31,076 Americans in homicides, suicides and unintentional shootings. This is the equivalent of more than 85 deaths each day and more than three deaths each hour. (Law Center to Prevent Gun Violence)

That’s the equivalent of 10 World Trade Tower bombings every year in terms of casualties. Every year. 9/11 x 10 from gun violence.

But it’s true, we were attacked. And our immune system responded completely disproportionately to the attack. We destabilized two countries, lost more in casualties than in the original attack, repealed most of our bill of rights, etc.

Now more than twelve (12) years later our intelligence services are still jumping at shadows and pressing for more security measures and less rights.

No one can prove that terrorists aren’t “out there,” but their remarkable lack of success is some indication that it isn’t a serious problem.

Consider the recent mall attack in Kenya.

First, can you find Kenya on a map? Anywhere close to the United States? I didn’t think so.

Second, and the casualties? More than sixty (60) dead? That less than four days of gun deaths in the United States.

I’m sorry the people in Kenya are dead but I am also sorry about the ongoing gun casualties in the United States.

If and when terrorism becomes a serious problem, then we can look for solutions.

Looking for solutions to fantasy attackers is a sure recipe for national bankruptcy and ruin.

New Graph Visualisation [ArangoDB]

Filed under: ArangoDB,Graphs — Patrick Durusau @ 4:08 pm

New Graph Visualisation

From the post:

Are you storing Graphs in ArangoDB?
Ever wondered how exactly your data looks like?
Do you want to visually explore and maintain your Graph data?

We have a solution for you!

With ArangoDB 1.4 we ship an additional tab in the Administration Interface called “Graphs”. In this tab you can load your graph data and start exploring it visually. After you have defined the collections where your data is stored you can search for a start vertex by defining an attribute-value pair contained in it, or you can start from any random vertex. The new interface will then offer you the means to explore the graph by loading child vertices (SPOT) and you can configure the interface to display labels on your nodes and to draw nodes with different content in different colors as shown in the screenshot.

New visualization capabilities are always welcome!

The screen shots are impressive. How does it work for you?

From The Front Lines of ADASS XXIII

Filed under: Astroinformatics,BigData — Patrick Durusau @ 4:04 pm

Bruce Berriman has been posting summaries of the 23rd Annual Astronomical Data Analysis and Software Systems (ADASS) is being held September 29th through Oct 3rd in Wailkoloa, Hawaii.

From The Front Lines of ADASS XXIII – Day One

From The Front Lines of ADASS XXIII – Day Two Morning

From The Front Lines of ADASS – Day 3 Morning

More posts to appear here.

The program with links to abstracts.

It wasn’t called “big data” back in the day but astronomers were early users of what is now called “big data.”

Enjoy!

The IMS Open Corpus Workbench (CWB)

Filed under: Corpora,Corpus Linguistics,Software — Patrick Durusau @ 3:46 pm

The IMS Open Corpus Workbench (CWB)

From the webpage:

The IMS Open Corpus Workbench (CWB) is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.

The first official open-source release of the Corpus Workbench (Version 3.0) is now available from this website. While many pages are still under construction, you can download release versions of the CWB, associated software and sample corpora. You will also find some documentation and other information in the different sections of this site.

If you are investigating large amounts of text, this may be the tool for you.

BTW, don’t miss: Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium by Stefan Evert and Andrew Hardie.

Abstract:

Corpus Workbench (CWB) is a widely-used architecture for corpus analysis, originally designed at the IMS, University of Stuttgart (Christ 1994). It consists of a set of tools for indexing, managing and querying very large corpora with multiple layers of word-level annotation. CWB’s central component is the Corpus Query Processor (CQP), an extremely powerful and efficient concordance system implementing a flexible two-level search language that allows complex query patterns to be specified both at the level of an individual word or annotation, and at the level of a fully- or partially-specified pattern of tokens. CWB and CQP are commonly used as the back-end for web-based corpus interfaces, for example, in the popular BNCweb interface to the British National Corpus (Hoffmann et al. 2008). CWB has influenced other tools, such as the Manatee software used in SketchEngine, which implements the same query language (Kilgarriff et al. 2004).

This paper details recent work to update CWB for the new century. Perhaps the most significant development is that CWB version 3 is now an open source project, licensed under the GNU General Public Licence. This change has substantially enlarged the community of developers and users and has enabled us to leverage existing open-source libraries in extending CWB’s capabilities. As a result, several key improvements were made to the CWB core: (i) support for multiple character sets, most especially Unicode (in the form of UTF-8), allowing all the world’s writing systems to be utilised within a CWB-indexed corpus; (ii) support for powerful Perl-style regular expressions in CQP queries, based on the open-source PCRE library; (iii) support for a wider range of OS platforms including Mac OS X, Linux, and Windows; and (iv) support for larger corpus sizes of up to 2 billion words on 64-bit platforms.

Outside the CWB core, a key concern is the user-friendliness of the interface. CQP itself can be daunting for beginners. However, it is common for access to CQP queries to be provided via a web-interface, supported in CWB version 3 by several Perl modules that give easy access to different facets of CWB/CQP functionality. The CQPweb front-end (Hardie forthcoming) has now been adopted as an integral component of CWB. CQPweb provides analysis options beyond concordancing (such as collocations, frequency lists, and keywords) by using a MySQL database alongside CQP. Available in both the Perl interface and CQPweb is the Common Elementary Query Language (CEQL), a simple-syntax set of search patterns and wildcards which puts much of
the power of CQP in a form accessible to beginning students and non-corpus-linguists.

The paper concludes with a roadmap for future development of the CWB (version 4 and above), with a focus on even larger corpora, full support for XML and dependency annotation, new types of query languages, and improved efficiency of complex CQP queries. All interested users are invited to help us shape the future of CWB by discussing requirements and contributing to the implementation of these features.

I have been using some commercial concordance software recently on standards drafts.

I need to give the IMS Open Corpus Workbench (CBW) a spin.

I would not worry about the 2 billion word corpus limitation.

That’s approximately 3,333.33 times the number of words in War and Peace by Leo Tolstoy. (I rounded the English translation word count up to 600,000 for an even number.)

Jump-start your data pipelining into Google BigQuery

Filed under: ETL,Google BigQuery,Google Compute Engine — Patrick Durusau @ 3:18 pm

Like they said at Woodstock, “if you don’t think ETL is all that weird,” wait, wasn’t that, “if you don’t think capitalism is all that weird?”

Maybe, maybe not. But in any event, Wally Yau has written guidance on getting the Google Compute Engine up and ready do to some ETL in Jump-start your data pipelining into Google BigQuery

Or if you have already “cooked” data there is another sample application, Automated File Loader for BigQuery, shows how to load data that will produce your desired results.

Both of these are from: Getting Started with Google BigQuery.

You do know that Google is located in the United States?

A DataViz Book Trifecta

Filed under: Graphics,Visualization — Patrick Durusau @ 2:58 pm

A DataViz Book Trifecta by Ben Jones.

Ben gives a quick overview of (and reason to read):

Creating More Effective Graphs – Naomi Robbins (2013)

The Functional Art – Alberto Cairo (2012)

Beautiful Visualization – Edited by Steele and Iliinsky (2010)

It’s never too early to start adding to your gift list. 😉

Markov Chains in Neo4j

Filed under: Graphs,Markov Decision Processes,Mathematics,Neo4j — Patrick Durusau @ 2:41 pm

Markov Chains in Neo4j by Nicole White.

From the post:

My new favorite thing lately is Neo4j, a graph database. It’s simple yet powerful: a graph database contains nodes and relationships, each which have properties. I recently made this submission to Neo4j’s GraphGist Challenge, which I did pretty well in.

After discovering Neo4j and graph databases a little over a month and a half ago, I’ve become subject to this weird syndrome where I think to myself, “Could I put that into a graph database?” with literally everything I encounter. The answer is usually yes.

Markov Chains

I realized the other day that nodes can have relationships with themselves, and for some reason, this immediately reminded me of Markov chains. The term Markov chain sounds intimidating at first (it did to me when I first saw the term on a syllabus), but they’re actually pretty simple: Markov chains consist of states and probabilities. The number of possible states is finite, and the Markov chain is a stochastic process that transitions, with certain probabilities, from one state to another over what I like to call time-steps.

The most important property of a Markov chain is that it is memoryless; that is, the probability of entering the next state depends only on the current state. We don’t care about where the process has been, only about where it is now.

If you wander over to the Wikipedia page on Markov chains, you’ll see pretty quickly why they are an obvious candidate for a graph database. The main profile picture for the page shows a Markov chain in graph form, where the states are nodes and the probabilities of transitioning from one state to another are the relationships between those nodes. The reason my realization mentioned earlier was important is that there is often a non-zero probability, given a Markov chain is in state A, that it will ‘enter’ state A in the next time-step. This is represented by a node that has a relationship with itself.

Interesting use of Neo4j to create a transition model.

Curious what you think of Nicole’s use of queries to avoid matrix multiplication?

It works but how often do you want to know the probability of one element in one state of a system?

Or would you extend the one element probability query to query more elements in a particular state?

Webinar: Trubo-Charging Solr

Filed under: Entity Resolution,Lucene,LucidWorks,Relevance,Solr — Patrick Durusau @ 10:40 am

Turbo-charge your Solr instance with Entity Recognition, Business Rules and a Relevancy Workbench by Yann Yu.

Date: Thursday, October 17, 2013
Time: 10:00am Pacific Time

From the post:

LucidWorks has three new modules available in the Solr Marketplace that run on top of your existing Solr or LucidWorks Search instance. Join us for an overview of each module and learn how implementing one, two or all three will turbo-charge your Solr instance.

  • Business Rules Engine: Out of the box integration with Drools, the popular open-source business rules engine is now available for Solr and LucidWorks Search. With the LucidWorks Business Rules module, developers can write complex rules using declarative syntax with very little programming. Data can be modified, cleaned and enriched through multiple permutations and combinations.
  • Relevancy Workbench: Experiment with different search parameters to understand the impact of these changes to search results. With intuitive, color-code and side-by-side comparisons of results for different sets of parameters, users can quickly tune their application to produce the results they need. The Relevancy Workbench encourages experimentation with a visual “before and after” view of the results of parameter changes.
  • Entity Recognition: Enhance Search applications beyond simple keyword search by adding intelligence through metadata. Help classify common patterns from unstructured data/content into predefined categories. Examples include names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages etc.

All of these modules will be of interest to topic mappers who are processing bulk data.

Hortonworks Sandbox – Default Instructional Tool?

Filed under: BigData,Eclipse,Hadoop,Hortonworks,Visualization — Patrick Durusau @ 10:07 am

Visualizing Big Data: Actuate, Hortonworks and BIRT

From the post:

Challenge

Hadoop stores data in key-value pairs. While the raw data is accessible to view, to be usable it needs to be presented in a more intuitive visualization format that will allow users to glean insights at a glance. While a business analytics tool can help business users gather those insights, to do so effectively requires a robust platform that can:

  • Work with expansive volumes of data
  • Offer standard and advanced visualizations, which can be delivered as reports, dashboards or scorecards
  • Be scalable to deliver these visualizations to a large number of users

Solution

When paired with Hortonworks, Actuate adds data visualization support for the Hadoop platform, using Hive queries to access data from Hortonworks. Actuate’s commercial product suite – built on open source Eclipse BIRT – extracts data from Hadoop, pulling data sets into interactive BIRT charts, dashboards and scorecards, allowing users to view and analyze data (see diagram below). With Actuate’s familiar approach to presenting information in easily modified charts and graphs, users can quickly identify patterns, resolve business issues and discover opportunities through personalized insights. This is further enhanced by Actuate’s inherent ability to combine Hadoop data with more traditional data sources in a single visualization screen or dashboard.

A BIRT/Hortonworks “Sandbox” for both the Eclipse open source and commercial versions of BIRT is now available. As a full HDP environment on a virtual machine, the Sandbox allows users to start benefiting quickly from Hortonworks’ distribution of Hadoop with BIRT functionality.

If you know about “big data” you should be familiar with the Hortonworks Sandbox.

Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!

What you may not know is that Hortonworks partners are creating additional tutorials based on the sandbox.

I count seven (7) to date and more are coming.

The Sandbox may become the default instructional tool for Hadoop.

That would be a benefit to all users, whatever the particulars of their environments.

The first GraphGist Challenge completed

Filed under: Graphs,Neo4j — Patrick Durusau @ 9:48 am

The first GraphGist Challenge completed by Anders Nawroth.

From the post:

We’re happy to announce the results of the first GraphGist challenge.

First of all, we want to thank all participants for their great contributions. We were blown away by the high quality of the contributions. Everyone has put in a lot of time and effort, providing thoughtful, interesting and well explained data models and Cypher queries. There was also great use of graphics, including use of the Arrows tool.

We thought we had high expectations, but the contributions still exceeded them by far. In this sense, everyone is a winner, and we look forward to sending out a cool Neo4j t-shirt and Graph Connect ticket or a copy of the Graph Databases book to all participants. And for the same reason, we strongly advice you to go have a look at all submissions.

The winners:

At third place, we find Chess Games and Positions by Wes Freeman. He makes it all sound very simple:

Learning Graph by Johannes Mockenhaupt comes in at second place. Here’s his own introduction to it:

The US Flights & Airports contribution from Nicole White finished first in this challenge. Congrats Nicole!

….

The near future:

If you want to have a look at the GraphGist project, it’s located here: https://github.com/neo4j-contrib/graphgist. It’s a client-side only browser-based application. Meaning, it’s basically a bunch of Javascript files. We’d be happy to see Pull Requests for the project. Please note that you can contribute styling or documentation (as a GraphGist), not only Javascript code!

We already got questions about the next GraphGist challenge. Our plan is to run the next challenge around the time Neo4j 2.0 gets released. Currently we think that will mean a closing date before Christmas. We’ll keep you posted when we know more.

Anders provides great descriptions of the winners but see their entries for full details.

For that matter, see all the entries. The breath of applications may surprise you.

Even if not, it will be good preparation for the next GraphGist challenge!

October 6, 2013

If it doesn’t work on mobile, it doesn’t work

Filed under: Graphics,Interface Research/Design,Topic Maps,Visualization — Patrick Durusau @ 7:30 pm

If it doesn’t work on mobile, it doesn’t work by Brian Boyer.

Brian’s notes from a presentation at Hacks/Hackers Buenos Aires last August.

The presentation is full of user survey results and statistics that are important for topic map interface designers.

At least if you want to be a successful topic map interface designer.

Curious, do you think consuming topic map based information will require a different interface from generic information consumption?

Reasoning that a consumer of information may not know or even care what technology underlies the presentation of desired information.

Would your response differ if I asked about authoring topic map content?

To simplify that question, let’s assume that we aren’t talking about a generic topic map authoring interface.

Say a baseball topic map authoring interface that accepts player’s name, positions, actions in games, etc., without exposing topic map machinery.

Apache Solr 4.5 documentation

Filed under: Lucene,Solr — Patrick Durusau @ 4:22 pm

Apache Solr 4.5 documentation

From the post:

Apache Solr PMC announced that the newest version of official Apache Solr documentation for Solr 4.5 (more about that version) is now available. The PDF file with documentation is available at: https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/.

If Apache Solr 4.5 was welcome news, this is even more so!

I am doing a lot of proofing of drafts (not by me) this week. Always refreshing to have alternative reading material that doesn’t make me wince.

That unfair. To the Apache Solr Reference Manual.

It is way better than simply not making me wince.

I am sure I will find things I would state differently but I feel confident I won’t encounter writing errors we were encouraged since grade school to avoid.

I won’t go into the details as someone might mistake description for recommendation. 😉

Enjoy the Apache Solr 4.5 documentation!

October 5, 2013

Apache Lucene 4.5 and Apache SolrTM 4.5 available

Filed under: Lucene,Solr — Patrick Durusau @ 7:07 pm

From: Apache Lucene News:

The Lucene PMC is pleased to announce the availability
of Apache Lucene 4.5 and Apache Solr 4.5.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Highlights of the Lucene release include:

  • Added support for missing values to DocValues fields through AtomicReader.getDocsWithField.
  • Lucene 4.5 has a new Lucene45Codec with Lucene45DocValues, supporting missing values and with most datastructures residing off-heap.
  • New in-memory DocIdSet implementations which are especially better than FixedBitSet on small sets: WAH8DocIdSet, PFORDeltaDocIdSet and EliasFanoDocIdSet.
  • CachingWrapperFilter now caches filters with WAH8DocIdSet by default, which has the same memory usage as FixedBitSet in the worst case but is smaller and faster on small sets.
  • TokenStreams now set the position increment in end(), so we can handle trailing holes.
  • IndexWriter no longer clones the given IndexWriterConfig.

Lucene 4.5 also includes numerous optimizations and bugfixes.

Highlights of the Solr release include:

  • Custom sharding support, including the ability to shard by field.
  • DocValue improvements: single valued fields no longer require a default value, allowiing dynamicFields to contain doc values, as well as sortMissingFirst and sortMissingLast on docValue fields.
  • Ability to store solr.xml in ZooKeeper.
  • Multithreaded faceting.
  • CloudSolrServer can now route updates directly to the appropriate shard leader.

Solr 4.5 also includes numerous optimizations and bugfixes.

Excellent!

How to: Scaling FoundationDB

Filed under: FoundationDB — Patrick Durusau @ 6:57 pm

How to: Scaling FoundationDB by Ben Collins.

From the post:

We put together this screencast to walk through how to install and scale your FoundationDB cluster beyond the default single-machine installation. In it, we explain how to add two more machines and configure the cluster to take advantage of the high performance and fault tolerance properties of FoundationDB.

I need to get some medium sized boxes for experimental purposes.

Winter is approaching and they would reduce the amount of time I have to run the heater. 😉

October 4, 2013

cypher-mode

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 4:42 pm

cypher-mode

From the webpage:

Emacs major mode for editing cypher scripts (Neo4j).

First *.el upload today. Could be interesting.

The Mathematical Shape of Things to Come

Filed under: Data Analysis,Mathematics,Modeling — Patrick Durusau @ 4:27 pm

The Mathematical Shape of Things to Come by Jennifer Ouellette.

From the post:

Simon DeDeo, a research fellow in applied mathematics and complex systems at the Santa Fe Institute, had a problem. He was collaborating on a new project analyzing 300 years’ worth of data from the archives of London’s Old Bailey, the central criminal court of England and Wales. Granted, there was clean data in the usual straightforward Excel spreadsheet format, including such variables as indictment, verdict, and sentence for each case. But there were also full court transcripts, containing some 10 million words recorded during just under 200,000 trials.

“How the hell do you analyze that data?” DeDeo wondered. It wasn’t the size of the data set that was daunting; by big data standards, the size was quite manageable. It was the sheer complexity and lack of formal structure that posed a problem. This “big data” looked nothing like the kinds of traditional data sets the former physicist would have encountered earlier in his career, when the research paradigm involved forming a hypothesis, deciding precisely what one wished to measure, then building an apparatus to make that measurement as accurately as possible.

From further in the post:

Today’s big data is noisy, unstructured, and dynamic rather than static. It may also be corrupted or incomplete. “We think of data as being comprised of vectors – a string of numbers and coordinates,” said Jesse Johnson, a mathematician at Oklahoma State University. But data from Twitter or Facebook, or the trial archives of the Old Bailey, look nothing like that, which means researchers need new mathematical tools in order to glean useful information from the data sets. “Either you need a more sophisticated way to translate it into vectors, or you need to come up with a more generalized way of analyzing it,” Johnson said.

All true but vectors expect a precision that is missing from any natural language semantic.

A semantic that varies from listener to listener. See: Is there a text in this class? : The authority of interpretive communities by Stanley Fish.

It is a delightful article, so long as one bears in mind that all representations of semantics are from a point of view.

The most we can say for any point of view is that it is useful for some stated purpose.

DocGraph – Neo4j Version

Filed under: Graphs,Neo4j — Patrick Durusau @ 4:01 pm

DocGraph – Neo4j Version by Max De Marzi.

Max has created a Neo4j version of the DocGraph dataset.

Enjoy!

The Irony of Obamacare:…

Filed under: Politics,Text Analytics,Text Mining — Patrick Durusau @ 3:49 pm

The Irony of Obamacare: Republicans Thought of It First by Meghan Foley.

From the post:

“An irony of the Patient Protection and Affordable Care Act (Obamacare) is that one of its key provisions, the individual insurance mandate, has conservative origins. In Congress, the requirement that individuals to purchase health insurance first emerged in Republican health care reform bills introduced in 1993 as alternatives to the Clinton plan. The mandate was also a prominent feature of the Massachusetts plan passed under Governor Mitt Romney in 2006. According to Romney, ‘we got the idea of an individual mandate from [Newt Gingrich], and [Newt] got it from the Heritage Foundation.’” – Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach

That irony led John Wilkerson of the University of Washington and his colleagues David Smith and Nick Stramp to study the legislative history of the health care reform law using a text-analysis system to understand its origins.

Scholars rely almost exclusively on floor roll call voting patterns to assess partisan cooperation in Congress, according to findings in the paper, Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach. By that standard, the Affordable Care was a highly partisan bill. Yet a different story emerges when the source of the reform’s policy is analyzed. The authors’ findings showed that a number of GOP policy ideas overlap with provisions in the Affordable Care Act: Of the 906-page law, 3 percent of the “policy ideas” used wording similar to bills sponsored by House Republicans and 8 percent used wording similar to bills sponsored by Senate Republicans.

In the paper, the authors say:

Our approach is to focus on legislative text. We assume that two bills share a policy idea when they share similar text. Of course, this raises many questions about whether similar text does actually capture shared policy ideas. This paper constitutes an early cut at the question.

The same thinking, similar text = similar ideas, permeates prior art searches on patents as well.

A more fruitful search would be of donor statements, proposals, literature for similar language/ideas.

In that regard, members of the United States Congress are just messengers.

PS: Thanks to Sam Hunting for the pointer to this article!

Does the NSA Not Share With The FBI?

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 3:17 pm

You may have heard the story on NPR: How Snowden’s Email Provider Tried To Foil The FBI Using Tiny Font by Eyder Peralta.

From the story:

Right before Edward Snowden told the world it was he who had leaked information about some of the government’s most secret surveillance programs, the FBI was hot on his trail.

One of the places agents looked was Lavabit, the company that hosted Snowden’s email account. As we told you back in August, Lavabit shuttered its service, saying it could not say why because a government gag order was issued.

The insinuation was clear: Lavabit had been served with a so-called national security letter, in which the FBI demands information about a user, but the service provider isn’t allowed to tell the user or anyone else that it was even asked about this.

On Wednesday, the documents filed (pdf) with the United States District Court for the Eastern District of Virginia were unsealed. Even though Snowden’s name is redacted, we can surmise this concerns him because of the charges against the person and the timing of the investigation. Perhaps more importantly, however, the documents give us detail of what’s usually a secret process. And it illuminates how Ladar Levison, Lavabit’s owner, tried to fight the government’s request for information on one of his users and then a subsequent request for an encryption key that would allow agents to read the communication of all its users.

After a bizarre back and forth, Ladar Levison was forced to give up the encryption key.

What puzzles me is why ask for the encryption key at all? Can’t the FBI ask the NSA to use its $billions in hardware/software and encryption standard corruption skills just to decrypt Snowden‘s emails?

Or on its own, can the FBI only use non-scrambled phone communications and plain text emails?

If that is the case, then something as simple as ROT13 will defeat the FBI.

To clue the FBI in on why sharing with the NSA is a one-way relationship – the NSA puts the FBI, the Army, Navy, Air Force, Marines, etc., all in the “semi-adversary” camp.

They don’t trust other government departments any more that they trust U.S. citizens.

Just remember the next time the Director of National Intelligence says “adversary,” to the NSA, that includes your name along with all other non-NSA groups, organizations, departments and people.

If the FBI won’t intervene to save us, maybe it will intervene to save itself.

Towards OLAP in Graph Databases (MSc. Thesis)

Filed under: Graphs,Neo4j — Patrick Durusau @ 1:40 pm

Towards OLAP in Graph Databases (MSc. Thesis) by Michal Bachman.

Abstract:

Graph databases are becoming increasingly popular as an alternative to relational databases for managing complex, densely-connected, semi-structured data. Whilst primarily optimised for online transactional processing, graph databases would greatly benefit from online analytical processing capabilities. Since relational databases were introduced over four decades ago, they have acquired online analytical processing facilities; this is not the case with graph databases, which have only drawn mainstream attention in the past few years.

In this project, we study the problem of online analytical processing in graph databases that use the property graph data model, which is a graph with properties attached to both vertices and edges. We use vertex degree analysis as a simple example problem, create a formal definition of vertex degree in a property graph, and develop a theoretical vertex degree cache with constant space and read time complexity, enabled by a cache compaction operation and a property change frequency heuristic.

We then apply the theory to Neo4j, an open-source property graph database, by developing a Relationship Count Module, which implements the theoretical vertex degree caching. We also design and implement a framework, called GraphAware, which provides supporting functionality for the module and serves as a platform for additional development, particularly of modules that store and maintain graph metadata.

Finally, we show that for certain use cases, for example those in which vertices have relatively high degrees and edges are created in separate transactions, vertex degree analysis can be performed several orders of magnitude faster, whilst sacrificing less than 20% of the write throughput, when using GraphAware Framework with the Relationship Count Module.

By demonstrating the extent of possible performance improvements, exposing the true complexity of a seemingly simple problem, and providing a starting point for future analysis and module development, we take an important step towards online analytical processing in graph databases.

The MSc. thesis: GraphAware: Towards Online Analytical Processing in Graph Databases.

Framework at Github: GraphAware Neo4j Framework.

Michal laments:

It’s not an easy, cover-to-cover read, but there might be some interesting parts, even if you don’t go through all the (over 100) pages.

It’s one hundred and forty-nine pages according to my PDF viewer.

I don’t think Michal needs to worry. If anyone thinks it is too long to read, it’s their loss.

Definitely going on my short list of things to read in detail sooner rather than later.

October 3, 2013

Easy k-NN Document Classification with Solr and Python

Filed under: K-Nearest-Neighbors,Python,Solr — Patrick Durusau @ 7:02 pm

Easy k-NN Document Classification with Solr and Python by John Berryman.

From the post:

You’ve got a problem: You have 1 buzzillion documents that must all be classified. Naturally, tagging them by hand is completely infeasible. However you are fortunate enough to have several thousand documents that have already been tagged. So why not…

Build a k-Nearest Neighbors Classifier!

The concept of a k-NN document classifier is actually quite simple. Basically, given a new document, find the k most similar documents within the tagged collection, retrieve the tags from those documents, and declare the input document to have the same tag as that which was most common among the similar documents. Now, taking a page from Taming Text (page 189 to be precise), do you know of any opensource products that are really good at similarity-based document retrieval? That’s right, Solr! Basically, given a new input document, all we have to do is scoop out the “statistically interesting” terms, submit a search composed of these terms, and count the tags that come back. And it even turns out that Solr takes care of identifying the “statistically interesting” terms. All we have to do is submit the document to the Solr MoreLikeThis handler. MoreLikeThis then scans through the document and extracts “Goldilocks” terms – those terms that are not too long, not too short, not too common, and not too rare… they’re all just right.

I don’t know how timely John’s post is for you but it is very timely for me. 😉

I was being asked yesterday about devising a rough cut over a body of texts.

Looking forward to putting this approach through its paces.

Dublin Lucene Revolution 2013 Sessions

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 6:45 pm

Dublin Lucene Revolution 2013 Sessions

Just a sampling to whet your appetite:

With many more entries in the intermediate and introductory levels.

Of all of the listed sessions, which ones will set your sights on Dublin?

Reminder: Training: November 4-5, Conference: November 6-7

October 2, 2013

Big Data vs. Relevant Data

Filed under: BigData — Patrick Durusau @ 6:06 pm

At 50 Powerful Statistics About Tech Mega Trends Affecting Every Business from ValaAfshar, I ran across one of those odd factoids that will be repeated.

2.5 billion gigabytes (2.5 Exabytes) of data are created every day. That number doubles every month. -IBM

I don’t know that anyone at IBM actually said that but it sounds like an IBM-like statement. 😉

Admittedly, that’s a lot of data.

But don’t go running out to subscribe to one of those $299/year technology newsletters because a lot of data is created everyday.

Think about all the sources of data and how many of them are relevant to you.

Traffic cameras and traffic lights are generating data, 24×7.

Rental car companies are generating data, including tracking data on cars.

Cell phone services are generating data, including data to pinpoint your location for the NSA. (Alleged to be stopped. Yeah, like they were not supposed to be spying on U.S. citizens. Right.)

Not to mention ATMs, weather data (down for the government shutdown), space data (also closed for government shutdown), plus a number of other data sources you could name off the top of your head.

All that data is included in the “2.5 billion gigabytes (2.5 Exabytes)” quoted above.

You may impress your brother-in-law by knowing that factoid but it’s not much use in planning an IT strategy.

A better starting place would be to ask, “What data is relevant to my product or market?”

Make out a list of data that is relevant today, may be relevant in a month and in a year.

With that beginning, you can start to measure the benefit of adding data versus the cost of adding it.

You may get to the “big data” stage but unlike some ventures, you will be making money along the way.

No Free Speech for Tech Firms?

Filed under: Law,NSA,Privacy,Security — Patrick Durusau @ 4:16 pm

I stumbled across Tech firms’ release of PRISM data will harm security — new U.S. and FBI court filings by Jeff John Roberts today.

From the post:

The Department of Justice, in long-awaited court filings that have just been released, urged America’s secret spy court to reject a plea by five major tech companies to disclose data about how often the government asks for user information under a controversial surveillance program aimed at foreign suspects.

The filings, which appeared on Wednesday, claimed that the tech companies – Google, Microsoft, Facebook, LinkedIn and Yahoo — do not have a First Amendment right to disclose how many Foreign Intelligence Surveillance Act requests they receive.

“Adversaries may alter their behavior by switching to service that the Government is not intercepting,” said the filings, which are heavily blacked out and cite Edward Snowden, a former NSA contractor. Snowden has caused an ongoing stir by leaking documents about a U.S. government program known as PRISM that vacuums up meta-data from the technology firms.

I thought we had settled the First Amendment for corporations back in Citizens United v. FEC.

Justice Kennedy, writing for the majority said:

The censorship we now confront is vast in its reach. The Government has “muffle[d] the voices that best represent the most significant segments of the economy.” Mc Connell, supra, at 257–258 (opinion of Scalia, J.). And “the electorate [has been] deprived of information, knowledge and opinion vital to its function.” CIO, 335 U. S., at 144 (Rutledge, J., concurring in result). By suppressing the speech of manifold corporations, both for-profit and nonprofit, the Government prevents their voices and viewpoints from reaching the public and advising voters on which persons or entities are hostile to their interests. Factions will necessarily form in our Republic, but the remedy of “destroying the liberty” of some factions is “worse than the disease.” The Federalist No. 10, p. 130 (B. Wright ed. 1961) (J. Madison). Factions should be checked by permitting them all to speak, see ibid., and by entrusting the people to judge what is true and what is false.

The purpose and effect of this law is to prevent corporations, including small and nonprofit corporations, from presenting both facts and opinions to the public. This makes Austin ’s antidistortion rationale all the more an aberration. “[T]he
First Amendment protects the right of corporations to petition legislative and administrative bodies.” Bellotti, 435 U. S., at 792, n. 31 (citing California Motor Transport Co. v. Trucking Unlimited, 404 U. S. 508, 510–511 (1972); Eastern Railroad Presidents Conference v. Noerr Motor Freight, Inc., 365 U. S. 127, 137–138 (1961)). Corporate executives and employees counsel Members of Congress and Presidential administrations on many issues, as a matter of routine and often in private. An amici brief filed on behalf of Montana and 25 other States notes that lobbying and corporate communications with elected officials occur on a regular basis. Brief for State of Montana et al. as Amici Curiae 19. When that phenomenon is coupled with §441b, the result is that smaller or nonprofit corporations cannot raise a voice to object when other corporations, including those with vast wealth, are cooperating with the Government. That cooperation may sometimes be voluntary, or it may be at the demand of a Government official who uses his or her authority, influence, and power to threaten corporations to support the Government’s policies. Those kinds of interactions are often unknown and unseen. The speech that §441b forbids, though, is public, and all can judge its content and purpose. References to massive corporate treasuries should not mask the real operation of this law. Rhetoric ought not obscure reality.

I admit that Citizens United v. FEC was about corporations buying elections but Justice Department censorship in this case is even worse.

Censorship in this case strikes at trust in products and services from: Google, Microsoft, Facebook, LinkedIn, Yahoo and Dropbox.

And it prevents consumers from making their own choices about who or what to trust.

Google, Microsoft, Facebook, LinkedIn, Yahoo and Dropbox should publish all the details of FISA requests.

Trust your customers/citizens to make the right choice.

PS: If Fortune 50 companies don’t have free speech, what do you think you have?

ExpressionBlast:… [Value of Mapping and Interchanging Mappings]

Filed under: Genomics,Merging,Topic Maps — Patrick Durusau @ 3:25 pm

ExpressionBlast: mining large, unstructured expression databases by Guy E Zinman, Shoshana Naiman, Yariv Kanfi, Haim Cohen and Ziv Bar-Joseph. (Nature Methods 10, 925–926 (2013))

From a letter to the editor:

To the Editor: The amount of gene expression data deposited in public repositories has grown exponentially over the last decade (Supplementary Fig. 1). Specifically, Gene Expression Omnibus (GEO)1 is one of largest expression-data repositories (Supplementary Table 1), containing hundreds of thousands of microarray and RNA-seq experiment results grouped into tens of thousands of series. Although accessible, data deposited in GEO are not well organized. Even among data sets for a single species there are many different platforms with different probe identifiers, different value scales and very limited annotations of the condition profiled by each array. Current methods for using GEO data to study signaling and other cellular networks either do not scale or cannot fully use the available information (Supplementary Table 2 and Supplementary Results).

To enable queries of such large expression databases, we developed ExpressionBlast (http://www.expression.cs.cmu.edu/): a computational method that uses automated text analysis to identify and merge replicates and determine the type of each array in the series (treatment or control; Fig. 1a and Supplementary Methods). Using this information, ExpressionBlast uniformly processes expression data sets in GEO across all experiments, species and platforms. This is achieved by standardizing the data in terms of gene identifiers, the meaning of the expression values (log ratios) and the distribution of these values (Fig. 1b and Supplementary Methods). Our processing steps achieved a high accuracy in identifying replicates and treatment control cases (Supplementary Results and Supplementary Table 3). We applied these processing steps to arrays from more than 900,000 individual samples collected from >40,000 studies in GEO (new series are updated on a weekly basis), which allowed us to create, to our knowledge, the largest collection of computationally annotated expression data currently available (Supplementary Results and Supplementary Table 4) (emphasis in original).

Now there is a letter to the editor!

Your first question:

How did the team create:

to our knowledge, the largest collection of computationally annotated expression data currently available….?

Hint: It wasn’t by creating a new naming system and then convincing the authors of > 40,000 studies to adopt a new naming system.

They achieved that result by:

This is achieved by standardizing the data in terms of gene identifiers, the meaning of the expression values (log ratios) and the distribution of these values (Fig. 1b and Supplementary Methods).

The benefit from this work begins where “merging” in the topic map sense ends.

One point of curiosity, among many, is the interchangeability of their rule based pattern expressions for merging replicates?

Even if the pattern expression language left execution up to the user, reliably exchanging mappings would be quite useful.

Perhaps a profile of an existing pattern expression language?

To avoid having to write one from scratch?

« Newer PostsOlder Posts »

Powered by WordPress