Archive for the ‘Filters’ Category

AI Assisted Filtering?

Thursday, February 23rd, 2017

Check Out Alphabet’s New Tool to Weed Out the ‘Toxic’ Abuse of Online Comments by Jeff John Roberts.

From the post:

A research team tied to Google unveiled a new tool on Thursday that could have a profound effect on how we talk to each other online. It’s called “Perspective,” and it provides a way for news websites and blogs to moderate online discussions with the help of artificial intelligence.

The researchers believe it could turn the tide against trolls on the Internet, and reestablish online comment forums—which many view as cesspools of hatred and stupidity—as a place for honest debate about current events.

The Perspective tool was hatched by artificial intelligence experts at Jigsaw, a subsidiary of Google-holding company Alphabet (GOOGL, -0.04%) that is devoted to policy and ideas. The significance of the tool, pictured below, is that it can decide if an online comment is “toxic” without the aid of human moderators. This means websites—many of which have given up on hosting comments altogether—could now have an affordable way to let their readers debate contentious topics of the day in a civil and respectful forum.

“Imagine trying to have a conversation with your friends about the news you read this morning, but every time you said something, someone shouted in your face, called you a nasty name or accused you of some awful crime,” Jigsaw founder and president Jared Cohen said in a blog post. “We think technology can help.”

I’m intrigued by this, at least to the extent that AI assisted filtering is extended to users. Such that a user can determine what comments they do/don’t see.

I avoid all manner of nonsense on the Internet, in part by there being places I simply don’t go. Not worth the effort to filter all the trash.

But at the same time, I don’t prevent other people, who may have differing definitions of “trash,” from consuming as much of it as they desire.

It’s really sad that Twitter continues to ignore the market potential of filters in favor of its mad-cap pursuit of being an Internet censor.

I have even added Ed Ho, said to be the VP of Engineering at Twitter, to one or more of my tweets suggesting ways Twitter could make money on filters. No response, nada.

It’s either “not invented here,” or Twitter staff spend so much time basking in their own righteousness they can’t be bothered with communications from venal creatures. Hard to say.

Jeff reports this is a work in progress and you can see it from yourself: What if technology could help improve conversations online?.

Check out the code at:

Or even Request API Access! (There no separate link, try:

Perspective can help with your authoring in real time.

Try setting the sensitivity very low and write/edit until it finally objects. 😉

Especially for Fox news comments. I always leave some profanity or ill comment unsaid. Maybe Perspective can help with that.

Twitter Said to Work on Anti-Harassment Keyword Filtering Tool [Good News!]

Sunday, August 28th, 2016

Twitter Said to Work on Anti-Harassment Keyword Filtering Tool by Sarah Frier.

From the post:

Twitter Inc. is working on a keyword-based tool that will let people filter the posts they see, giving users a more effective way to block out harassing and offensive tweets, according to people familiar with the matter.

The San Francisco-based company has been discussing how to implement the tool for about a year as it seeks to stem abuse on the site, said the people, who asked not to be identified because the initiative isn’t public. By using keywords, users could block swear words or racial slurs, for example, to screen out offenders.

Nice to have good news to report about Twitter!

Suggestions before the code gets set in stone:

  • Enable users to “follow” filters of other users
  • Enable filters to filter on nicknames in content and as sender
  • Regexes anyone?

A big step towards empowering users!

Technology Adoption – Nearly A Vertical Line (To A Diminished IQ)

Thursday, March 10th, 2016


From: There’s a major long-term trend in the economy that isn’t getting enough attention by Rick Rieder.

From the post:

As the chart above shows, people in the U.S. today are adopting new technologies, including tablets and smartphones, at the swiftest pace we’ve seen since the advent of the television. However, while television arguably detracted from U.S. productivity, today’s advances in technology are generally geared toward greater efficiency at lower costs. Indeed, when you take into account technology’s downward influence on price, U.S. consumption and productivity figures look much better than headline numbers would suggest.

Hmmm, did you catch that?

…while television arguably detracted from U.S. productivity, today’s advances in technology are generally geared toward greater efficiency at lower costs.

Really? Rick must have missed the memo on how multitasking (one aspect of smart phones, tablets, etc.) lowers your IQ by 15 points. About what you would expect from smoking a joint.

If technology encourages multitasking, making us dumber, then we are becoming less efficient. Yes?

Imagine if instead of scrolling past tweets with images of cats, food, irrelevant messages, every time you look at your Twitter time line, you got the two or three tweets relevant to your job function.

Each of those not-on-task tweets chips away at the amount of attention span you have to spend on the two or three truly important tweets.

Apps that consolidate, filter and diminish information flow are the path to greater productivity.

Topic maps anyone?

Filter [Impersonating You]

Friday, October 9th, 2015


From the webpage:

Filter shows you the top stories from communities of Twitter users across a range of topics like climate change, bitcoin, and U.S. foreign policy.

With Filter, the only way you’ll miss something is if the entire community misses it too.

Following entire Twitter communities is a good idea but signing in with Twitter enables Filter to impersonate you.

This application will be able to:

  • Read Tweets from your timeline.
  • See who you follow, and follow new people.
  • Update your profile.
  • Post Tweets for you.

(emphasis added)

My complaint is general to all Sign in with Twitter applications and Filter is just an example I encountered this morning.

I can’t explore and report to you the features or shortcomings of Filter because I am happy with my current following list and have no desire to allow some unknown (read untrusted) third-party posting on my Twitter account.

If you encounter a review of Filter by someone who isn’t bothered by being randomly impersonated, send me a link. I would like to know more about the site.


Kalman and Bayesian Filters in Python

Monday, March 9th, 2015

Kalman and Bayesian Filters in Python by Roger Labbe.

Apologies for the lengthy quote but Roger makes a great case for interactive textbooks, IPython notebooks, writing for the reader as opposed to making the author feel clever, and finally, making content freely available.

It is a quote that I am going to make a point to read on a regular basis.

And all of that before turning to the subject at hand!


From the preface:

This is a book for programmers that have a need or interest in Kalman filtering. The motivation for this book came out of my desire for a gentle introduction to Kalman filtering. I’m a software engineer that spent almost two decades in the avionics field, and so I have always been ‘bumping elbows’ with the Kalman filter, but never implemented one myself. They always has a fearsome reputation for difficulty, and I did not have the requisite education. Everyone I met that did implement them had multiple graduate courses on the topic and extensive industrial experience with them. As I moved into solving tracking problems with computer vision the need to implement them myself became urgent. There are classic textbooks in the field, such as Grewal and Andrew’s excellent Kalman Filtering. But sitting down and trying to read many of these books is a dismal and trying experience if you do not have the background. Typically the first few chapters fly through several years of undergraduate math, blithely referring you to textbooks on, for example, Itō calculus, and presenting an entire semester’s worth of statistics in a few brief paragraphs. These books are good textbooks for an upper undergraduate course, and an invaluable reference to researchers and professionals, but the going is truly difficult for the more casual reader. Symbology is introduced without explanation, different texts use different words and variables names for the same concept, and the books are almost devoid of examples or worked problems. I often found myself able to parse the words and comprehend the mathematics of a definition, but had no idea as to what real world phenomena these words and math were attempting to describe. “But what does that mean?” was my repeated thought.

However, as I began to finally understand the Kalman filter I realized the underlying concepts are quite straightforward. A few simple probability rules, some intuition about how we integrate disparate knowledge to explain events in our everyday life and the core concepts of the Kalman filter are accessible. Kalman filters have a reputation for difficulty, but shorn of much of the formal terminology the beauty of the subject and of their math became clear to me, and I fell in love with the topic.

As I began to understand the math and theory more difficulties itself. A book or paper’s author makes some statement of fact and presents a graph as proof. Unfortunately, why the statement is true is not clear to me, nor is the method by which you might make that plot obvious. Or maybe I wonder “is this true if R=0?” Or the author provides pseudocode – at such a high level that the implementation is not obvious. Some books offer Matlab code, but I do not have a license to that expensive package. Finally, many books end each chapter with many useful exercises. Exercises which you need to understand if you want to implement Kalman filters for yourself, but exercises with no answers. If you are using the book in a classroom, perhaps this is okay, but it is terrible for the independent reader. I loathe that an author withholds information from me, presumably to avoid ‘cheating’ by the student in the classroom.

None of this necessary, from my point of view. Certainly if you are designing a Kalman filter for a aircraft or missile you must thoroughly master of all of the mathematics and topics in a typical Kalman filter textbook. I just want to track an image on a screen, or write some code for my Arduino project. I want to know how the plots in the book are made, and chose different parameters than the author chose. I want to run simulations. I want to inject more noise in the signal and see how a filter performs. There are thousands of opportunities for using Kalman filters in everyday code, and yet this fairly straightforward topic is the provenance of rocket scientists and academics.

I wrote this book to address all of those needs. This is not the book for you if you program avionics for Boeing or design radars for Raytheon. Go get a degree at Georgia Tech, UW, or the like, because you’ll need it. This book is for the hobbyist, the curious, and the working engineer that needs to filter or smooth data.

This book is interactive. While you can read it online as static content, I urge you to use it as intended. It is written using IPython Notebook, which allows me to combine text, python, and python output in one place. Every plot, every piece of data in this book is generated from Python that is available to you right inside the notebook. Want to double the value of a parameter? Click on the Python cell, change the parameter’s value, and click ‘Run’. A new plot or printed output will appear in the book.

This book has exercises, but it also has the answers. I trust you. If you just need an answer, go ahead and read the answer. If you want to internalize this knowledge, try to implement the exercise before you read the answer.

This book has supporting libraries for computing statistics, plotting various things related to filters, and for the various filters that we cover. This does require a strong caveat; most of the code is written for didactic purposes. It is rare that I chose the most efficient solution (which often obscures the intent of the code), and in the first parts of the book I did not concern myself with numerical stability. This is important to understand – Kalman filters in aircraft are carefully designed and implemented to be numerically stable; the naive implementation is not stable in many cases. If you are serious about Kalman filters this book will not be the last book you need. My intention is to introduce you to the concepts and mathematics, and to get you to the point where the textbooks are approachable.

Finally, this book is free. The cost for the books required to learn Kalman filtering is somewhat prohibitive even for a Silicon Valley engineer like myself; I cannot believe the are within the reach of someone in a depressed economy, or a financially struggling student. I have gained so much from free software like Python, and free books like those from Allen B. Downey here [1]. It’s time to repay that. So, the book is free, it is hosted on free servers, and it uses only free and open software such as IPython and mathjax to create the book.

I first saw this in a tweet by nixCraft.

A Latent Source Model for Online Collaborative Filtering

Wednesday, December 10th, 2014

A Latent Source Model for Online Collaborative Filtering by Guy Bresler, George H. Chen, and Devavrat Shah.


Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the “online” setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of n users either likes or dislikes each of m items. We assume there to be k types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again. The goal is to maximize the number of likable items recommended to users over time. Our main result establishes that after nearly log(km) initial learning time steps, a simple collaborative filtering algorithm achieves essentially optimal performance without knowing k. The algorithm has an exploitation step that uses cosine similarity and two types of exploration steps, one to explore the space of items (standard in the literature) and the other to explore similarity between users (novel to this work).

The similarity between users makes me wonder if merging results from a topic map could or should be returned on the basis of a similarity of users? On the assumption that at some point of similarity that distinct users share views about subject identity.

Consensus Filters

Wednesday, October 22nd, 2014

Consensus Filters by Yao Yujian.

From the post:

Suppose you have a huge number of robots/vehicles and you want all of them to track some global value, maybe the average of the weight of the fuel that each contains.

One way to do this is to have a master server that takes in everyone’s input and generates the output. So others can get it from the master. But this approach results in a single point of failure and a huge traffic to one server.

The other way is to let all robots talk to each other, so each robot will have information from others, which can then be used to compute the sum. Obviously this will incur a huge communication overhead. Especially if we need to generate the value frequently.

If we can tolerate approximate results, we have a third approach: consensus filters.

There are two advantages to consensus filters:

  1. Low communication overhead
  2. Approximate values can be used even without a consensus

Approximate results won’t be acceptable for all applications but where they are, consensus filters may be on your agenda.

Bloom Filters

Wednesday, October 15th, 2014

Bloom Filters by Jason Davies.

From the post:

Everyone is always raving about bloom filters. But what exactly are they, and what are they useful for?

Very straightforward explanation along with interactive demo. The applications section will immediately suggest how Bloom filters could be used when querying.

There are other complexities, see the Bloom Filter entry at Wikipedia. But as a first blush explanation, you will be hard pressed to find one as good as Jason’s.

I first saw this in a tweet by Allen Day.

Filtering: Seven Principles

Tuesday, January 7th, 2014

Filtering: Seven Principles by JP Rangaswami.

When you read “filters” in the seven rules, think merging rules.

From the post:

  1. Filters should be built such that they are selectable by subscriber, not publisher.
  2. Filters should intrinsically be dynamic, not static.
  3. Filters should have inbuilt “serendipity” functionality.
  4. Filters should be interchangeable, exchangeable, even tradeable.
  5. The principal filters should be by choosing a variable and a value (or range of values) to include or exclude.
  6. Secondary filters should then be about routing.
  7. Network-based filters, “collaborative filtering” should then complete the set.

Nat Torkington comments on this list:

I think the basic is: 0: Customers should be able to run their own filters across the information you’re showing them.


And it should be simpler than hunting for .config/google-chrome/Default/User Stylesheets/Custom.css (for Chrome on Ubuntu).

Ideally a select (from a webpage) and choose an action.

The ability to dynamically select properties for merging would greatly enhance a user’s ability to explore and mine a topic map.

I first saw this in Nat Torkington’s Four short links: 6 January 2014.


Wednesday, March 13th, 2013


I am not sure how “hard” the numbers are but CRM application claims:

Up to 15% increase in revenues

66% less time wasted on finding and re-finding information

15% increase in win rates

I take this as evidence there is a market for less noisy data streams.

If filtered search can produce this kind of ROI, imagine what curated search can do.



Friday, March 8th, 2013

Crossfilter: Fast Multidimensional Filtering for Coordinated Views

From the webpage:

Crossfilter is a JavaScript library for exploring large multivariate datasets in the browser. Crossfilter supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Crossfilter uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Crossfilter works, see the API reference.

See the webpage for an impressive demonstration with a 5.3 MB dataset.

Is there a trend towards “big data” manipulation on clusters and “less big data” in browsers?

Will be interesting to see how the benchmarks for “big” and “less big” move over time.

I first saw this in Nat Torkington’s Four Short links: 4 March 2013.

A nice collaborative filtering tutorial “for dummies”

Tuesday, March 5th, 2013

A nice collaborative filtering tutorial “for dummies”

Danny Bickson writes:

I got from M. Burhan, one of our GraphChi users from Germany, the following link to an online book called: A Programmer’s Guide to Data Mining.

There are two relevant chapters that may help beginners understand the basic concepts.

The first one of them is Chapter 2: Collaborative Filtering and Chapter 3: Implicit Ratings and Item Based Filtering.

Collaborative Filtering via Group-Structured Dictionary Learning

Wednesday, January 30th, 2013

Collaborative Filtering via Group-Structured Dictionary Learning by Zoltan Szabo, Barnabas Poczos , and Andras Lorincz.


Structured sparse coding and the related structured dictionary learning problems are novel research areas in machine learning. In this paper we present a new application of structured dictionary learning for collaborative filtering based recommender systems. Our extensive numerical experiments demonstrate that the presented method outperforms its state-of-the-art competitors and has several advantages over approaches that do not put structured constraints on the dictionary elements.

From the paper:

Novel advances on CF show that dictionary learning based approaches can be efficient for making predictions about users’ preferences [2]. The dictionary learning based approach assumes that (i) there is a latent, unstructured feature space (hidden representation/code) behind the users’ ratings, and (ii) a rating of an item is equal to the product of the item and the user’s feature.

Is a “preference” actually a form of subject identification?

I ask because the notion of a “real time” system is incompatible with users researching the proper canonical subject identifier and/or waiting for a response from an inter-departmental committee to agree on correct terminology.

Perhaps subject identification in some systems must be on the basis of “…latent, unstructured feature space[s]…” that are known (and disclosed) imperfectly at best.

Zoltán Szabó’s Home Page, numerous publications and the source code for this article.

SVDFeature: A Toolkit for Feature-based Collaborative Filtering

Thursday, January 17th, 2013

SVDFeature: A Toolkit for Feature-based Collaborative Filtering – implementation by Igor Carron.

From the post:

SVDFeature: A Toolkit for Feature-based Collaborative Filtering by Tianqi ChenWeinan Zhang,  Qiuxia LuKailong Chen Zhao Zheng, Yong Yu. The abstract reads:

In this paper we introduce SVDFeature, a machine learning toolkit for feature-based collaborative filtering. SVDFeature is designed to efficiently solve the feature-based matrix factorization. The feature-based setting allows us to build factorization models incorporating side information such as temporal dynamics, neighborhood relationship, and hierarchical information. The toolkit is capable of both rate prediction and collaborative ranking, and is carefully designed for efficient training on large-scale data set. Using this toolkit, we built solutions to win KDD Cup for two consecutive years.

The wiki for the project and attendant code is here.

Can’t argue with two KDD cups in as many years!

Licensed under Apache 2.0.

Learning Mahout : Collaborative Filtering [Recommend Your Preferences?]

Friday, August 24th, 2012

Learning Mahout : Collaborative Filtering by Sujit Pal.

From the post:

My Mahout in Action (MIA) book has been collecting dust for a while now, waiting for me to get around to learning about Mahout. Mahout is evolving quite rapidly, so the book is a bit dated now, but I decided to use it as a guide anyway as I work through the various modules in the currently GA) 0.7 distribution.

My objective is to learn about Mahout initially from a client perspective, ie, find out what ML modules (eg, clustering, logistic regression, etc) are available, and which algorithms are supported within each module, and how to use them from my own code. Although Mahout provides non-Hadoop implementations for almost all its features, I am primarily interested in the Hadoop implementations. Initially I just want to figure out how to use it (with custom code to tweak behavior). Later, I would like to understand how the algorithm is represented as a (possibly multi-stage) M/R job so I can build similar implementations.

I am going to write about my progress, mainly in order to populate my cheat sheet in the sky (ie, for future reference). Any code I write will be available in this GitHub (Scala) project.

The first module covered in the book is Collaborative Filtering. Essentially, it is a technique of predicting preferences given the preferences of others in the group. There are two main approaches – user based and item based. In case of user-based filtering, the objective is to look for users similar to the given user, then use the ratings from these similar users to predict a preference for the given user. In case of item-based recommendation, similarities between pairs of items are computed, then preferences predicted for the given user using a combination of the user’s current item preferences and the similarity matrix.

While you are working your way through this post, keep in mind: Collaborative filtering with GraphChi.

Question: What if you are an outlier?

Telephone marketing interviews with me get shortened by responses like: “X? Is that a TV show?”

How would you go about piercing the marketing veil to recommend your preferences?

Now that is a product to which even I might subscribe. (But don’t advertise on TV, I won’t see it.)

Mozilla Ignite [Challenge – $15,000]

Friday, June 15th, 2012

Mozilla Ignite

From the webpage:

Calling all developers, network engineers and community catalysts. Mozilla and the National Science Foundation (NSF) invite designers, developers and everyday people to brainstorm and build applications for the faster, smarter Internet of the future. The goal: create apps that take advantage of next-generation networks up to 250 times faster than today, in areas that benefit the public — like education, healthcare, transportation, manufacturing, public safety and clean energy.

Designing for the internet of the future

The challenge begins with a “Brainstorming Round” where anyone can submit and discuss ideas. The best ideas will receive funding and support to become a reality. Later rounds will focus specifically on application design and development. All are welcome to participate in the brainstorming round.


What would you do with 1 Gbps? What apps would you create for deeply programmable networks 250x faster than today? Now through August 23rd, let’s brainstorm. $15,000 in prizes.

The challenge is focused specifically on creating public benefit in the U.S. The deadline for idea submissions is August 23, 2012.

Here is the entry website.

I assume the 1Gbps is actual and not as measured by the marketing department of the local cable company. 😉

That would have to be from a source that can push 1 Gbps to you and you be capable of handling it. (Upstream limitations being what chokes my local speed down.)

I went looking for an example of what that would mean and came up with: “…[you] can download 23 episodes of 30 Rock in less than two minutes.

On the whole, I would rather not.

What other uses would you suggest for 1Gbps network speeds?

Assuming you have the capacity to push back at the same speed, I wonder what that means in terms of querying/viewing data as a topic map?

Transformation to a topic map for only for a subset of data?

Looking forward to seeing your entries!

Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Tuesday, June 5th, 2012

6th International Workshop on Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Paper Submission: June 21, 2012
Authors Notification: July 10, 2012
Final Paper Submission and Registration: July 24, 2012

In conjunction with International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management – IC3K 2012 – 04 – 07 October, 2012 – Barcelona, Spain.


Nowadays users are more and more interested in information rather than in mere raw data. The huge amount of accessible data sources is growing rapidly. This calls for novel systems providing effective means of searching and retrieving information with the fundamental goal of making it exploitable by humans and machines.
DART focuses on researching and studying new challenges in distributed information filtering and retrieval. In particular, DART aims to investigate novel systems and tools to distributed scenarios and environments. DART will contribute to discuss and compare suitable novel solutions based on intelligent techniques and applied in real-world applications.
Information Retrieval attempts to address similar filtering and ranking problems for pieces of information such as links, pages, and documents. Information Retrieval systems generally focus on the development of global retrieval techniques, often neglecting individual user needs and preferences.
Information Filtering has drastically changed the way information seekers find what they are searching for. In fact, they effectively prune large information spaces and help users in selecting items that best meet their needs, interests, preferences, and tastes. These systems rely strongly on the use of various machine learning tools and algorithms for learning how to rank items and predict user evaluation.

Topics of Interest

Topics of interest will include (but not are limited to):

  • Web Information Filtering and Retrieval
  • Web Personalization and Recommendation
  • Web Advertising
  • Web Agents
  • Web of Data
  • Semantic Web
  • Linked Data
  • Semantics and Ontology Engineering
  • Search for Social Networks and Social Media
  • Natural Language and Information Retrieval in the Social Web
  • Real-time Search
  • Text categorization

If you are interested and have the time (or graduate students with the time), abstracts from prior conferences are here. Would be a useful exercise to search out publicly available copies. (As far as I can tell, no abstracts from DART.)

How to Stay Current in Bioinformatics/Genomics [Role for Topic Maps as Filters?]

Wednesday, May 30th, 2012

How to Stay Current in Bioinformatics/Genomics by Stephen Turner.

From the post:

A few folks have asked me how I get my news and stay on top of what’s going on in my field, so I thought I’d share my strategy. With so many sources of information begging for your attention, the difficulty is not necessarily finding what’s interesting, but filtering out what isn’t. What you don’t read is just as important as what you do, so when it comes to things like RSS, Twitter, and especially e-mail, it’s essential to filter out sources where the content consistently fails to be relevant or capture your interest. I run a bioinformatics core, so I’m more broadly interested in applied methodology and study design rather than any particular phenotype, model system, disease, or method. With that in mind, here’s how I stay current with things that are relevant to me. Please leave comments with what you’re reading and what you find useful that I omitted here.

Here is a concrete example of the information feeds used to stay current on bioinformatics/genomics.

A topic map mantra has been: “All the information about a subject in one place.”

Should that change to: “Current information about subject(s) ….,” rather than aggregation, topic maps as a filtering strategy?

I think of filters as “subtractive” but that is only one view of filtering.

Can have “additive” filters as well.

Take a look at the information feeds Stephen is using.

Would you use topic maps as “additive” or “subtractive” filters?

Custom security filtering in Solr

Tuesday, April 3rd, 2012

Custom security filtering in Solr by Erik Hatcher

Yonik recently wrote about “Advanced Filter Caching in Solr” where he talked about expensive and custom filters; it was left as an exercise to the reader on the implementation details. In this post, I’m going to provide a concrete example of custom post filtering for the case of filtering documents based on access control lists.

Recap of Solr’s filtering and caching

First let’s review Solr’s filtering and caching capabilities. Queries to Solr involve a full-text, relevancy scored, query (the infamous q parameter). As users navigate they will browse into facets. The search application generates filter query (fq) parameters for faceted navigation (eg. fq=color:red, as in the article referenced above). The filter queries are not involved in document scoring, serving only to reduce the search space. Solr sports a filter cache, caching the document sets of each unique filter query. These document sets are generated in advance, cached, and reduce the documents considered by the main query. Caching can be turned off on a per-filter basis; when filters are not cached, they are used in parallel to the main query to “leap frog” to documents for consideration, and a cost can be associated with each filter in order to prioritize the leap-frogging (smallest set first would minimize documents being considered for matching).

Post filtering

Even without caching, filter sets default to generate in advance. In some cases it can be extremely expensive and prohibitive to generate a filter set. One example of this is with access control filtering that needs to take the users query context into account in order to know which documents are allowed to be returned or not. Ideally only matching documents, documents that match the query and straightforward filters, should be evaluated for security access control. It’s wasteful to evaluate any other documents that wouldn’t otherwise match anyway. So let’s run through an example… a contrived example for the sake of showing how Solr’s post filtering works.

Good examples but also heed the author’s warning to use the techniques in this article when necessary. Some times simple solutions are the best. Like using the network authentication layer to prevent unauthorized users from seeing the Solr application at all. No muss, no fuss.

Tesseract – Fast Multidimensional Filtering for Coordinated Views

Sunday, March 25th, 2012

Tesseract – Fast Multidimensional Filtering for Coordinated Views

From the post:

Tesseract is a JavaScript library for filtering large multivariate datasets in the browser. Tesseract supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Tesseract uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Tesseract works, see the API reference.

Are you ready to “slice and dice” your data set?

On the Power of HBase Filters

Thursday, March 8th, 2012

On the Power of HBase Filters

From the post:

Filters are a powerful feature of HBase to delegate the selection of rows to the servers rather than moving rows to the Client. We present the filtering mechanism as an illustration of the general data locality principle and compare it to the traditional select-and-project data access pattern.

Dealing with massive amounts of data changes the way you think about data processing tasks. In a standard business application context, people use a Relational Database System (RDBMS) and consider this system as a service in charge of providing data to the client application. How this data is processed, manipulated, shown to the user, is considered to be the full responsability of the application. In other words, the role of the data server is restricted to what is does best: efficient, safe and consistent storage and access.

The post goes on to observe:

When you deal with BigData, the data center is your computer.

True, but that isn’t the lesson I would draw from HBase Filters.

The lesson I would draw is: it is only big data until you can find the relevant data.

I may have to sift several haystacks of data but at the end of the day I want the name, photo, location, target, time frame for any particular evil-doer. That “big data” was part of the process is a fact, not a goal. Yes?

A Simple News Exploration Interface

Monday, November 14th, 2011

A Simple News Exploration Interface

Matthew Hurst writes:

I’ve just pushed out the next version of the hapax page. I’ve changed the interface to allow for dynamic filtering of the news stories presented. You can now type in filter terms (such as ‘bbc’ or ‘greece’) and the page will only display those stories that are related to those terms.

Very cool!

New Challenges in Distributed Information Filtering and Retrieval

Sunday, September 11th, 2011

New Challenges in Distributed Information Filtering and Retrieval

Proceedings of the 5th International Workshop on New Challenges in Distributed Information Filtering and Retrieval
Palermo, Italy, September 17, 2011.

Edited by:

Cristian Lai – CRS4, Loc. Piscina Manna, Building 1 – 09010 Pula (CA), Italy

Giovanni Semeraro – Dept. of Computer Science, University of Bari, Aldo Moro, Via E. Orabona, 4, 70125 Bari, Italy

Eloisa Vargiu – Dept. of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123 Cagliari, Italy

Table of Contents:

  1. Experimenting Text Summarization on Multimodal Aggregation
    Giuliano Armano, Alessandro Giuliani, Alberto Messina, Maurizio Montagnuolo, Eloisa Vargiu
  2. From Tags to Emotions: Ontology-driven Sentimental Analysis in the Social Semantic Web
    Matteo Baldoni, Cristina Baroglio, Viviana Patti, Paolo Rena
  3. A Multi-Agent Decision Support System for Dynamic Supply Chain Organization
    Luca Greco, Liliana Lo Presti, Agnese Augello, Giuseppe Lo Re, Marco La Cascia, Salvatore Gaglio
  4. A Formalism for Temporal Annotation and Reasoning of Complex Events in Natural Language
    Francesco Mele, Antonio Sorgente
  5. Interaction Mining: the new Frontier of Call Center Analytics
    Vincenzo Pallotta, Rodolfo Delmonte, Lammert Vrieling, David Walker
  6. Context-Aware Recommender Systems: A Comparison Of Three Approaches
    Umberto Panniello, Michele Gorgoglione
  7. A Multi-Agent System for Information Semantic Sharing
    Agostino Poggi, Michele Tomaiuolo
  8. Temporal characterization of the requests to Wikipedia
    Antonio J. Reinoso, Jesus M. Gonzalez-Barahona, Rocio Muñoz-Mansilla, Israel Herraiz
  9. From Logical Forms to SPARQL Query with GETARUN
    Rocco Tripodi, Rodolfo Delmonte
  10. ImageHunter: a Novel Tool for Relevance Feedback in Content Based Image Retrieval
    Roberto Tronci, Gabriele Murgia, Maurizio Pili, Luca Piras, Giorgio Giacinto

OrganiK Knowledge Management System

Monday, July 4th, 2011

OrganiK Knowledge Management System (wiki)

OrganiK Knowledge Management System (homepage)

I encountered the OrganiK project while searching for something else (naturally). 😉

From the homepage:

Objectives of the Project

The aim of the OrganiK project is to research and develop an innovative knowledge management system that enables the semantic fusion of enterprise social software applications. The system accumulates information that can be exchanged among one or several collaborating companies. This enables an effective management of organisational knowledge and can be adapted to functional requirements of smaller and knowledge-intensive companies.

More info..

Main distinguishing features

The set of OrganiK KM Client Interfaces comprises of a Wiki, a Blog, a Social Bookmarking and a Search Component that together constitute a Collaborative Workspace for SME knowledge workers. Each of the components consists of a Web-based client interface and a corresponding server engine.
The components that comprise the Business Logic Layer of the OrganiK KM Server are:

  • the Recommender System,
  • the Semantic Text Analyser,
  • the Collaborative Filtering Engine
  • the Full-text Indexer

More info…

Interesting project but the latest news item dates from 2008. Not encouraging.

I checked the source code and the most recent update was August, 2010. Much more encouraging.

Have written for more recent news.


Sunday, July 3rd, 2011


From the Get Started page:

The mission of the SwiftRiver initiative is to democratize access to the tools used to make sense of data.

To achieve this goal we’ve taken two approaches, apps and APIs. Apps are user facing and should be tools that are easy to understand, deploy and use. APIs are machine facing and extract meta-context that other machines (apps) use to convey information to the end user.

SwiftRiver is an opensource platform that aims to allow users to do three things well: 1) structure unstructured data feeds, 2) filter and prioritize information conditionally and 3) add context to content. Doing these things well allows users to pull in real-time content from Twitter, SMS, Email or the Web and to make sense of data on the fly.

The Ushahidi logo at the top will take you to a common wiki for Ushahidi and SwithRiver.

And the Ushahidi link in text takes you to: Ushahidi:

We are a non-profit tech company that develops free and open source software for information collection, visualization and interactive mapping.

Home of:

  • Ushahidi Platform: We built the Ushahidi platform as a tool to easily crowdsource information using multiple channels, including SMS, email, Twitter and the web.
  • SwiftRiver: SwiftRiver is an open source platform that aims to democratize access to tools for filtering & making sense of real-time information.
  • Crowdmap: When you need to get the Ushahidi platform up in 2 minutes to crowdsource information, Crowdmap will do it for you. It’s our hosted version of the Ushahidi platform.
  • It occurs to me that mapping email feeds would fit right into my example in Marketing What Users Want…And An Example.

    Big Data Could Be Big Pain Without Semantic Search To Help Filter It

    Thursday, June 23rd, 2011

    Big Data Could Be Big Pain Without Semantic Search To Help Filter It

    From the post:

    Search Explore Engine leverages the core of its Cogito Focus technology to provides multiple ways to filter data with the help of semantic tagging and categorization. But it also includes a new interface that Scagliarini says makes it more accessible to less advanced users for intuitive, visual navigation of tags and facets, as well as interaction with search results to discover new connections and data.

    One feature, the treemapgraphic, summarizes information included in a search data stream by representing each topic in a different color, using the size of squares to indicate the frequency of similar documents, and using shades of color tp distinguish recent news from older events.

    “A big chunk of the innovation in Search Explore Engine is really to make it simple to integrate information,” he says. As an example, it provides an out-of-the-box geographic taxonomy for identifying specific geographic areas referenced in the dynamic information stream or licensed data streams, and enabling users to create ways to access that information using integration with maps. “So they can create an area of interest [on a map] and retrieve information mainly about that area. Or there’s the possibility to give a visualization of entity maps – all the entities included in a set of documents that you select have a visual representation that shows which kinds of entities are related to which kind of other entities, so you can use the map to filter down and identify your search criteria or your search intent,” he says.

    The solution is initially targeted at advanced knowledge workers but Scagliarini says that the user base will expand pretty quickly. “This level of sophistication is done by the business analyst or the marketing managers or those dealing with extracting knowledge,” who will prepackage and distribute the information inside the organization, he says, “but we think progressively this need is broader in the organization. If you don’t have any kind of ways to filter more effectively all the information you have access to, you are already at a disadvantage and that can get only worse.”

    I am torn between the two lines:

    there’s the possibility to give a visualization of entity maps – all the entities included in a set of documents that you select have a visual representation that shows which kinds of entities are related to which kind of other entities, so you can use the map to filter down and identify your search criteria or your search intent (emphasis added)


    If you don’t have any kind of ways to filter more effectively all the information you have access to, you are already at a disadvantage and that can get only worse. (emphasis added)

    as to which one I like better.

    The one on “entity maps” is talking about topic maps without using the term and the one about filtering captures one aspect of the modern information dilemma that topic maps can solve.

    Which one do you like better?

    The Science and Magic of User and Expert Feedback for Improving Recommendations

    Friday, May 27th, 2011

    The Science and Magic of User and Expert Feedback for Improving Recommendations by Dr. Xavier Amatriain (Telefonica).


    Recommender systems are playing a key role in the next web revolution as a practical alternative to traditional search for information access and filtering. Most of these systems use Collaborative Filtering techniques in which predictions are solely based on the feedback of the user and similar peers. Although this approach is considered relatively effective, it has reached some practical limitations such as the so-called Magic Barrier. Many of these limitations strive from the fact that explicit user feedback in the form of ratings is considered the ground truth. However, this feedback has a non-negligible amount of noise and inconsistencies. Furthermore, in most practical applications, we lack enough explicit feedback and would be better off using implicit feedback or usage data.

    In the first part of my talk, I will present our studies in analyzing natural noise in explicit feedback and finding ways to overcome it to improve recommendation accuracy. I will also present our study of user implicit feedback and an approach to relate both kinds of information. In the second part, I will introduce a radically different approach to recommendation that is based on the use of the opinions of experts instead of regular peers. I will show how this approach addresses many of the shortcomings of traditional Collaborative Filtering, generates recommendations that are better perceived by the users, and allows for new applications such as fully-privacy preserving recommendations.

    Chris Anderson: “We are leaving the age of information and entering the age of recommendation.”

    I suspect Chris Anderson must not be an active library user. Long before recommender systems, librarians have been making recommendations to researchers, patrons and children doing homework. I would say we are returning to the age of librarians, assisted by recommender systems.

    Librarians use the reference interview so that based on feedback from patrons they can make the appropriate recommendations.

    If you substitute librarian for “expert” in this presentation, it becomes apparent the world of information is coming back around to libraries and librarians.

    Librarians should be making the case, both in the literature but to researchers like Dr. Amatriain, that librarians can play a vital role in recommender systems.

    This is a very enjoyable as well as useful presentation.

    For further information see:

    The Filter Bubble: Algorithm vs. Curator & the Value of Serendipity

    Monday, May 16th, 2011

    The Filter Bubble: Algorithm vs. Curator & the Value of Serendipity by Maria Popova.

    Covers the same TED presentation that I mention at On the dangers of personalization but with the value-add that Maria both interviews Eli Pariser and talks about his new book, The Filter Bubble.

    I remain untroubled by filtering.

    We filter the information we give others around us.

    Advertisers filter the information they present in commercials.

    For example, I don’t recall any Toyota ads that end with: Buy a Toyota ****, your odds of being in a recall are 1 in ***. That’s filtering.

    Two things would increase my appreciation for Google filtering:

    First, much better filtering, where I can choose narrow-band filter(s) based on my interests.

    Second, the ability to turn the filters off at my option.

    You see, I don’t agree that there is information I need to know as determined by someone else.

    Here’s an interesting question: What information would you filter from:

    On the dangers of personalization

    Saturday, May 7th, 2011

    On the dangers of personalization

    From the post:

    We’re getting our search results seriously edited and, I bet, most of us don’t even know it. I didn’t. One Google engineer says that their search engine uses 57 signals to personalize your search results, even when you’re logged out.

    Do we really want to live in a web bubble?

    What I find interesting about this piece is that it describes a data silo but from the perspective of an individual.

    Think about it.

    A data silo is based on data that is filtered and stored.

    Personalization is based on data that is filtered and presented.

    Do you see any difference?