## Archive for the ‘Filters’ Category

### squirro

Wednesday, March 13th, 2013

squirro

I am not sure how “hard” the numbers are but CRM application claims:

Up to 15% increase in revenues

66% less time wasted on finding and re-finding information

15% increase in win rates

I take this as evidence there is a market for less noisy data streams.

If filtered search can produce this kind of ROI, imagine what curated search can do.

Yes?

### Crossfilter

Friday, March 8th, 2013

Crossfilter: Fast Multidimensional Filtering for Coordinated Views

From the webpage:

Crossfilter is a JavaScript library for exploring large multivariate datasets in the browser. Crossfilter supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Crossfilter uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Crossfilter works, see the API reference.

See the webpage for an impressive demonstration with a 5.3 MB dataset.

Is there a trend towards “big data” manipulation on clusters and “less big data” in browsers?

Will be interesting to see how the benchmarks for “big” and “less big” move over time.

I first saw this in Nat Torkington’s Four Short links: 4 March 2013.

### A nice collaborative filtering tutorial “for dummies”

Tuesday, March 5th, 2013

A nice collaborative filtering tutorial “for dummies”

Danny Bickson writes:

I got from M. Burhan, one of our GraphChi users from Germany, the following link to an online book called: A Programmer’s Guide to Data Mining.

There are two relevant chapters that may help beginners understand the basic concepts.

The first one of them is Chapter 2: Collaborative Filtering and Chapter 3: Implicit Ratings and Item Based Filtering.

### Collaborative Filtering via Group-Structured Dictionary Learning

Wednesday, January 30th, 2013

Collaborative Filtering via Group-Structured Dictionary Learning by Zoltan Szabo, Barnabas Poczos , and Andras Lorincz.

Abstract:

Structured sparse coding and the related structured dictionary learning problems are novel research areas in machine learning. In this paper we present a new application of structured dictionary learning for collaborative filtering based recommender systems. Our extensive numerical experiments demonstrate that the presented method outperforms its state-of-the-art competitors and has several advantages over approaches that do not put structured constraints on the dictionary elements.

From the paper:

Novel advances on CF show that dictionary learning based approaches can be eﬃcient for making predictions about users’ preferences [2]. The dictionary learning based approach assumes that (i) there is a latent, unstructured feature space (hidden representation/code) behind the users’ ratings, and (ii) a rating of an item is equal to the product of the item and the user’s feature.

Is a “preference” actually a form of subject identification?

I ask because the notion of a “real time” system is incompatible with users researching the proper canonical subject identifier and/or waiting for a response from an inter-departmental committee to agree on correct terminology.

Perhaps subject identification in some systems must be on the basis of “…latent, unstructured feature space[s]…” that are known (and disclosed) imperfectly at best.

### SVDFeature: A Toolkit for Feature-based Collaborative Filtering

Thursday, January 17th, 2013

From the post:

SVDFeature: A Toolkit for Feature-based Collaborative Filtering by Tianqi ChenWeinan Zhang,  Qiuxia LuKailong Chen Zhao Zheng, Yong Yu. The abstract reads:

In this paper we introduce SVDFeature, a machine learning toolkit for feature-based collaborative ﬁltering. SVDFeature is designed to efﬁciently solve the feature-based matrix factorization. The feature-based setting allows us to build factorization models incorporating side information such as temporal dynamics, neighborhood relationship, and hierarchical information. The toolkit is capable of both rate prediction and collaborative ranking, and is carefully designed for efﬁcient training on large-scale data set. Using this toolkit, we built solutions to win KDD Cup for two consecutive years.

The wiki for the project and attendant code is here.

Can’t argue with two KDD cups in as many years!

### Learning Mahout : Collaborative Filtering [Recommend Your Preferences?]

Friday, August 24th, 2012

Learning Mahout : Collaborative Filtering by Sujit Pal.

From the post:

My Mahout in Action (MIA) book has been collecting dust for a while now, waiting for me to get around to learning about Mahout. Mahout is evolving quite rapidly, so the book is a bit dated now, but I decided to use it as a guide anyway as I work through the various modules in the currently GA) 0.7 distribution.

My objective is to learn about Mahout initially from a client perspective, ie, find out what ML modules (eg, clustering, logistic regression, etc) are available, and which algorithms are supported within each module, and how to use them from my own code. Although Mahout provides non-Hadoop implementations for almost all its features, I am primarily interested in the Hadoop implementations. Initially I just want to figure out how to use it (with custom code to tweak behavior). Later, I would like to understand how the algorithm is represented as a (possibly multi-stage) M/R job so I can build similar implementations.

I am going to write about my progress, mainly in order to populate my cheat sheet in the sky (ie, for future reference). Any code I write will be available in this GitHub (Scala) project.

The first module covered in the book is Collaborative Filtering. Essentially, it is a technique of predicting preferences given the preferences of others in the group. There are two main approaches – user based and item based. In case of user-based filtering, the objective is to look for users similar to the given user, then use the ratings from these similar users to predict a preference for the given user. In case of item-based recommendation, similarities between pairs of items are computed, then preferences predicted for the given user using a combination of the user’s current item preferences and the similarity matrix.

While you are working your way through this post, keep in mind: Collaborative filtering with GraphChi.

Question: What if you are an outlier?

Telephone marketing interviews with me get shortened by responses like: “X? Is that a TV show?”

How would you go about piercing the marketing veil to recommend your preferences?

Now that is a product to which even I might subscribe. (But don’t advertise on TV, I won’t see it.)

### Mozilla Ignite [Challenge - $15,000] Friday, June 15th, 2012 Mozilla Ignite From the webpage: Calling all developers, network engineers and community catalysts. Mozilla and the National Science Foundation (NSF) invite designers, developers and everyday people to brainstorm and build applications for the faster, smarter Internet of the future. The goal: create apps that take advantage of next-generation networks up to 250 times faster than today, in areas that benefit the public — like education, healthcare, transportation, manufacturing, public safety and clean energy. Designing for the internet of the future The challenge begins with a “Brainstorming Round” where anyone can submit and discuss ideas. The best ideas will receive funding and support to become a reality. Later rounds will focus specifically on application design and development. All are welcome to participate in the brainstorming round. ﻿﻿﻿﻿BRAINSTORM What would you do with 1 Gbps? What apps would you create for deeply programmable networks 250x faster than today? Now through August 23rd, let’s brainstorm.$15,000 in prizes.

The challenge is focused specifically on creating public benefit in the U.S. The deadline for idea submissions is August 23, 2012.

Here is the entry website.

I assume the 1Gbps is actual and not as measured by the marketing department of the local cable company.

That would have to be from a source that can push 1 Gbps to you and you be capable of handling it. (Upstream limitations being what chokes my local speed down.)

I went looking for an example of what that would mean and came up with: “…[you] can download 23 episodes of 30 Rock in less than two minutes.

On the whole, I would rather not.

What other uses would you suggest for 1Gbps network speeds?

Assuming you have the capacity to push back at the same speed, I wonder what that means in terms of querying/viewing data as a topic map?

Transformation to a topic map for only for a subset of data?

Looking forward to seeing your entries!

### Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Tuesday, June 5th, 2012

6th International Workshop on Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Paper Submission: June 21, 2012
Final Paper Submission and Registration: July 24, 2012

In conjunction with International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management – IC3K 2012 – 04 – 07 October, 2012 – Barcelona, Spain.

Scope

Nowadays users are more and more interested in information rather than in mere raw data. The huge amount of accessible data sources is growing rapidly. This calls for novel systems providing effective means of searching and retrieving information with the fundamental goal of making it exploitable by humans and machines.
DART focuses on researching and studying new challenges in distributed information filtering and retrieval. In particular, DART aims to investigate novel systems and tools to distributed scenarios and environments. DART will contribute to discuss and compare suitable novel solutions based on intelligent techniques and applied in real-world applications.
Information Retrieval attempts to address similar filtering and ranking problems for pieces of information such as links, pages, and documents. Information Retrieval systems generally focus on the development of global retrieval techniques, often neglecting individual user needs and preferences.
Information Filtering has drastically changed the way information seekers find what they are searching for. In fact, they effectively prune large information spaces and help users in selecting items that best meet their needs, interests, preferences, and tastes. These systems rely strongly on the use of various machine learning tools and algorithms for learning how to rank items and predict user evaluation.

Topics of Interest

Topics of interest will include (but not are limited to):

• Web Information Filtering and Retrieval
• Web Personalization and Recommendation
• Web Agents
• Web of Data
• Semantic Web
• Semantics and Ontology Engineering
• Search for Social Networks and Social Media
• Natural Language and Information Retrieval in the Social Web
• Real-time Search
• Text categorization

If you are interested and have the time (or graduate students with the time), abstracts from prior conferences are here. Would be a useful exercise to search out publicly available copies. (As far as I can tell, no abstracts from DART.)

### How to Stay Current in Bioinformatics/Genomics [Role for Topic Maps as Filters?]

Wednesday, May 30th, 2012

How to Stay Current in Bioinformatics/Genomics by Stephen Turner.

From the post:

A few folks have asked me how I get my news and stay on top of what’s going on in my field, so I thought I’d share my strategy. With so many sources of information begging for your attention, the difficulty is not necessarily finding what’s interesting, but filtering out what isn’t. What you don’t read is just as important as what you do, so when it comes to things like RSS, Twitter, and especially e-mail, it’s essential to filter out sources where the content consistently fails to be relevant or capture your interest. I run a bioinformatics core, so I’m more broadly interested in applied methodology and study design rather than any particular phenotype, model system, disease, or method. With that in mind, here’s how I stay current with things that are relevant to me. Please leave comments with what you’re reading and what you find useful that I omitted here.

Here is a concrete example of the information feeds used to stay current on bioinformatics/genomics.

A topic map mantra has been: “All the information about a subject in one place.”

Should that change to: “Current information about subject(s) ….,” rather than aggregation, topic maps as a filtering strategy?

I think of filters as “subtractive” but that is only one view of filtering.

Can have “additive” filters as well.

Take a look at the information feeds Stephen is using.

Would you use topic maps as “additive” or “subtractive” filters?

### Custom security filtering in Solr

Tuesday, April 3rd, 2012

Custom security filtering in Solr by Erik Hatcher

Yonik recently wrote about “Advanced Filter Caching in Solr” where he talked about expensive and custom filters; it was left as an exercise to the reader on the implementation details. In this post, I’m going to provide a concrete example of custom post filtering for the case of filtering documents based on access control lists.

Recap of Solr’s filtering and caching

First let’s review Solr’s filtering and caching capabilities. Queries to Solr involve a full-text, relevancy scored, query (the infamous q parameter). As users navigate they will browse into facets. The search application generates filter query (fq) parameters for faceted navigation (eg. fq=color:red, as in the article referenced above). The filter queries are not involved in document scoring, serving only to reduce the search space. Solr sports a filter cache, caching the document sets of each unique filter query. These document sets are generated in advance, cached, and reduce the documents considered by the main query. Caching can be turned off on a per-filter basis; when filters are not cached, they are used in parallel to the main query to “leap frog” to documents for consideration, and a cost can be associated with each filter in order to prioritize the leap-frogging (smallest set first would minimize documents being considered for matching).

Post filtering

Even without caching, filter sets default to generate in advance. In some cases it can be extremely expensive and prohibitive to generate a filter set. One example of this is with access control filtering that needs to take the users query context into account in order to know which documents are allowed to be returned or not. Ideally only matching documents, documents that match the query and straightforward filters, should be evaluated for security access control. It’s wasteful to evaluate any other documents that wouldn’t otherwise match anyway. So let’s run through an example… a contrived example for the sake of showing how Solr’s post filtering works.

Good examples but also heed the author’s warning to use the techniques in this article when necessary. Some times simple solutions are the best. Like using the network authentication layer to prevent unauthorized users from seeing the Solr application at all. No muss, no fuss.

### Tesseract – Fast Multidimensional Filtering for Coordinated Views

Sunday, March 25th, 2012

Tesseract – Fast Multidimensional Filtering for Coordinated Views

From the post:

Tesseract is a JavaScript library for filtering large multivariate datasets in the browser. Tesseract supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Tesseract uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Tesseract works, see the API reference.

### On the Power of HBase Filters

Thursday, March 8th, 2012

From the post:

Filters are a powerful feature of HBase to delegate the selection of rows to the servers rather than moving rows to the Client. We present the filtering mechanism as an illustration of the general data locality principle and compare it to the traditional select-and-project data access pattern.

Dealing with massive amounts of data changes the way you think about data processing tasks. In a standard business application context, people use a Relational Database System (RDBMS) and consider this system as a service in charge of providing data to the client application. How this data is processed, manipulated, shown to the user, is considered to be the full responsability of the application. In other words, the role of the data server is restricted to what is does best: efficient, safe and consistent storage and access.

The post goes on to observe:

When you deal with BigData, the data center is your computer.

True, but that isn’t the lesson I would draw from HBase Filters.

The lesson I would draw is: it is only big data until you can find the relevant data.

I may have to sift several haystacks of data but at the end of the day I want the name, photo, location, target, time frame for any particular evil-doer. That “big data” was part of the process is a fact, not a goal. Yes?

### A Simple News Exploration Interface

Monday, November 14th, 2011

A Simple News Exploration Interface

Matthew Hurst writes:

I’ve just pushed out the next version of the hapax page. I’ve changed the interface to allow for dynamic filtering of the news stories presented. You can now type in filter terms (such as ‘bbc’ or ‘greece’) and the page will only display those stories that are related to those terms.

Very cool!

### New Challenges in Distributed Information Filtering and Retrieval

Sunday, September 11th, 2011

New Challenges in Distributed Information Filtering and Retrieval

Proceedings of the 5th International Workshop on New Challenges in Distributed Information Filtering and Retrieval
Palermo, Italy, September 17, 2011.

Edited by:

Cristian Lai – CRS4, Loc. Piscina Manna, Building 1 – 09010 Pula (CA), Italy

Giovanni Semeraro – Dept. of Computer Science, University of Bari, Aldo Moro, Via E. Orabona, 4, 70125 Bari, Italy

Eloisa Vargiu – Dept. of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123 Cagliari, Italy

1. Experimenting Text Summarization on Multimodal Aggregation
Giuliano Armano, Alessandro Giuliani, Alberto Messina, Maurizio Montagnuolo, Eloisa Vargiu
2. From Tags to Emotions: Ontology-driven Sentimental Analysis in the Social Semantic Web
Matteo Baldoni, Cristina Baroglio, Viviana Patti, Paolo Rena
3. A Multi-Agent Decision Support System for Dynamic Supply Chain Organization
Luca Greco, Liliana Lo Presti, Agnese Augello, Giuseppe Lo Re, Marco La Cascia, Salvatore Gaglio
4. A Formalism for Temporal Annotation and Reasoning of Complex Events in Natural Language
Francesco Mele, Antonio Sorgente
5. Interaction Mining: the new Frontier of Call Center Analytics
Vincenzo Pallotta, Rodolfo Delmonte, Lammert Vrieling, David Walker
6. Context-Aware Recommender Systems: A Comparison Of Three Approaches
Umberto Panniello, Michele Gorgoglione
7. A Multi-Agent System for Information Semantic Sharing
Agostino Poggi, Michele Tomaiuolo
8. Temporal characterization of the requests to Wikipedia
Antonio J. Reinoso, Jesus M. Gonzalez-Barahona, Rocio Muñoz-Mansilla, Israel Herraiz
9. From Logical Forms to SPARQL Query with GETARUN
Rocco Tripodi, Rodolfo Delmonte
10. ImageHunter: a Novel Tool for Relevance Feedback in Content Based Image Retrieval
Roberto Tronci, Gabriele Murgia, Maurizio Pili, Luca Piras, Giorgio Giacinto

### OrganiK Knowledge Management System

Monday, July 4th, 2011

OrganiK Knowledge Management System (wiki)

OrganiK Knowledge Management System (homepage)

I encountered the OrganiK project while searching for something else (naturally).

From the homepage:

Objectives of the Project

The aim of the OrganiK project is to research and develop an innovative knowledge management system that enables the semantic fusion of enterprise social software applications. The system accumulates information that can be exchanged among one or several collaborating companies. This enables an effective management of organisational knowledge and can be adapted to functional requirements of smaller and knowledge-intensive companies.

Main distinguishing features

The set of OrganiK KM Client Interfaces comprises of a Wiki, a Blog, a Social Bookmarking and a Search Component that together constitute a Collaborative Workspace for SME knowledge workers. Each of the components consists of a Web-based client interface and a corresponding server engine.
The components that comprise the Business Logic Layer of the OrganiK KM Server are:

• the Recommender System,
• the Semantic Text Analyser,
• the Collaborative Filtering Engine
• the Full-text Indexer

Interesting project but the latest news item dates from 2008. Not encouraging.

I checked the source code and the most recent update was August, 2010. Much more encouraging.

Have written for more recent news.

### SwiftRiver/Ushahidi

Sunday, July 3rd, 2011

SwiftRiver

From the Get Started page:

The mission of the SwiftRiver initiative is to democratize access to the tools used to make sense of data.

To achieve this goal we’ve taken two approaches, apps and APIs. Apps are user facing and should be tools that are easy to understand, deploy and use. APIs are machine facing and extract meta-context that other machines (apps) use to convey information to the end user.

SwiftRiver is an opensource platform that aims to allow users to do three things well: 1) structure unstructured data feeds, 2) filter and prioritize information conditionally and 3) add context to content. Doing these things well allows users to pull in real-time content from Twitter, SMS, Email or the Web and to make sense of data on the fly.

The Ushahidi logo at the top will take you to a common wiki for Ushahidi and SwithRiver.

And the Ushahidi link in text takes you to: Ushahidi:

We are a non-profit tech company that develops free and open source software for information collection, visualization and interactive mapping.

Home of:

• Ushahidi Platform: We built the Ushahidi platform as a tool to easily crowdsource information using multiple channels, including SMS, email, Twitter and the web.
• SwiftRiver: SwiftRiver is an open source platform that aims to democratize access to tools for filtering & making sense of real-time information.
• Crowdmap: When you need to get the Ushahidi platform up in 2 minutes to crowdsource information, Crowdmap will do it for you. It’s our hosted version of the Ushahidi platform.
• It occurs to me that mapping email feeds would fit right into my example in Marketing What Users Want…And An Example.

### Big Data Could Be Big Pain Without Semantic Search To Help Filter It

Thursday, June 23rd, 2011

Big Data Could Be Big Pain Without Semantic Search To Help Filter It

From the post:

Search Explore Engine leverages the core of its Cogito Focus technology to provides multiple ways to filter data with the help of semantic tagging and categorization. But it also includes a new interface that Scagliarini says makes it more accessible to less advanced users for intuitive, visual navigation of tags and facets, as well as interaction with search results to discover new connections and data.

One feature, the treemapgraphic, summarizes information included in a search data stream by representing each topic in a different color, using the size of squares to indicate the frequency of similar documents, and using shades of color tp distinguish recent news from older events.

“A big chunk of the innovation in Search Explore Engine is really to make it simple to integrate information,” he says. As an example, it provides an out-of-the-box geographic taxonomy for identifying specific geographic areas referenced in the dynamic information stream or licensed data streams, and enabling users to create ways to access that information using integration with maps. “So they can create an area of interest [on a map] and retrieve information mainly about that area. Or there’s the possibility to give a visualization of entity maps – all the entities included in a set of documents that you select have a visual representation that shows which kinds of entities are related to which kind of other entities, so you can use the map to filter down and identify your search criteria or your search intent,” he says.

The solution is initially targeted at advanced knowledge workers but Scagliarini says that the user base will expand pretty quickly. “This level of sophistication is done by the business analyst or the marketing managers or those dealing with extracting knowledge,” who will prepackage and distribute the information inside the organization, he says, “but we think progressively this need is broader in the organization. If you don’t have any kind of ways to filter more effectively all the information you have access to, you are already at a disadvantage and that can get only worse.”

I am torn between the two lines:

there’s the possibility to give a visualization of entity maps – all the entities included in a set of documents that you select have a visual representation that shows which kinds of entities are related to which kind of other entities, so you can use the map to filter down and identify your search criteria or your search intent (emphasis added)

or

If you don’t have any kind of ways to filter more effectively all the information you have access to, you are already at a disadvantage and that can get only worse. (emphasis added)

as to which one I like better.

The one on “entity maps” is talking about topic maps without using the term and the one about filtering captures one aspect of the modern information dilemma that topic maps can solve.

Which one do you like better?

### The Science and Magic of User and Expert Feedback for Improving Recommendations

Friday, May 27th, 2011

The Science and Magic of User and Expert Feedback for Improving Recommendations by Dr. Xavier Amatriain (Telefonica).

Abstract:

Recommender systems are playing a key role in the next web revolution as a practical alternative to traditional search for information access and filtering. Most of these systems use Collaborative Filtering techniques in which predictions are solely based on the feedback of the user and similar peers. Although this approach is considered relatively effective, it has reached some practical limitations such as the so-called Magic Barrier. Many of these limitations strive from the fact that explicit user feedback in the form of ratings is considered the ground truth. However, this feedback has a non-negligible amount of noise and inconsistencies. Furthermore, in most practical applications, we lack enough explicit feedback and would be better off using implicit feedback or usage data.

In the first part of my talk, I will present our studies in analyzing natural noise in explicit feedback and finding ways to overcome it to improve recommendation accuracy. I will also present our study of user implicit feedback and an approach to relate both kinds of information. In the second part, I will introduce a radically different approach to recommendation that is based on the use of the opinions of experts instead of regular peers. I will show how this approach addresses many of the shortcomings of traditional Collaborative Filtering, generates recommendations that are better perceived by the users, and allows for new applications such as fully-privacy preserving recommendations.

Chris Anderson: “We are leaving the age of information and entering the age of recommendation.”

I suspect Chris Anderson must not be an active library user. Long before recommender systems, librarians have been making recommendations to researchers, patrons and children doing homework. I would say we are returning to the age of librarians, assisted by recommender systems.

Librarians use the reference interview so that based on feedback from patrons they can make the appropriate recommendations.

If you substitute librarian for “expert” in this presentation, it becomes apparent the world of information is coming back around to libraries and librarians.

Librarians should be making the case, both in the literature but to researchers like Dr. Amatriain, that librarians can play a vital role in recommender systems.

This is a very enjoyable as well as useful presentation.

For further information see:

http://xavier.amatriain.net

http://technocalifornia.blogspot.net

### The Filter Bubble: Algorithm vs. Curator & the Value of Serendipity

Monday, May 16th, 2011

Covers the same TED presentation that I mention at On the dangers of personalization but with the value-add that Maria both interviews Eli Pariser and talks about his new book, The Filter Bubble.

I remain untroubled by filtering.

We filter the information we give others around us.

Advertisers filter the information they present in commercials.

For example, I don’t recall any Toyota ads that end with: Buy a Toyota ****, your odds of being in a recall are 1 in ***. That’s filtering.

Two things would increase my appreciation for Google filtering:

First, much better filtering, where I can choose narrow-band filter(s) based on my interests.

Second, the ability to turn the filters off at my option.

You see, I don’t agree that there is information I need to know as determined by someone else.

Here’s an interesting question: What information would you filter from: www.cnn.com?

### On the dangers of personalization

Saturday, May 7th, 2011

On the dangers of personalization

From the post:

We’re getting our search results seriously edited and, I bet, most of us don’t even know it. I didn’t. One Google engineer says that their search engine uses 57 signals to personalize your search results, even when you’re logged out.

Do we really want to live in a web bubble?

What I find interesting about this piece is that it describes a data silo but from the perspective of an individual.

A data silo is based on data that is filtered and stored.

Personalization is based on data that is filtered and presented.

Do you see any difference?