Archive for the ‘Searching’ Category

Can You Replicate Your Searches?

Thursday, February 16th, 2017

A comment at PubMed raises the question of replicating reported literature searches:

From the comment:

Mellisa Rethlefsen

I thank the authors of this Cochrane review for providing their search strategies in the document Appendix. Upon trying to reproduce the Ovid MEDLINE search strategy, we came across several errors. It is unclear whether these are transcription errors or represent actual errors in the performed search strategy, though likely the former.

For instance, in line 39, the search is “tumour bed” [quotes not in original]. The correct syntax would be “tumour bed,kw,ti,ab” [no quotes]. The same is true for line 41, where the commas are replaced with periods.

In line 42, the search is “Breast Neoplasms /” [quotes not in original]. It is not entirely clear what the authors meant here, but likely they meant to search the MeSH heading Breast Neoplasms with the subheading radiotherapy. If that is the case, the search should have been “Breast Neoplasms/rt” [no quotes].

In lines 43 and 44, it appears as though the authors were trying to search for the MeSH term “Radiotherapy, Conformal” with two different subheadings, which they spell out and end with a subject heading field search (i.e., Radiotherapy, Conformal/adverse In Ovid syntax, however, the correct search syntax would be “Radiotherapy, Conformal/ae” [no quotes] without the subheading spelled out and without the extraneous .sh.

In line 47, there is another minor error, again with .sh being extraneously added to the search term “Radiotherapy/” [quotes not in original].

Though these errors are minor and are highly likely to be transcription errors, when attempting to replicate this search, each of these lines produces an error in Ovid. If a searcher is unaware of how to fix these problems, the search becomes unreplicable. Because the search could not have been completed as published, it is unlikely this was actually how the search was performed; however, it is a good case study to examine how even small details matter greatly for reproducibility in search strategies.

A great reminder that replication of searches is a non-trivial task and that search engines are literal to the point of idiocy.

We’re Bringing Learning to Rank to Elasticsearch [Merging Properties Query Dependent?]

Tuesday, February 14th, 2017

We’re Bringing Learning to Rank to Elasticsearch.

From the post:

It’s no secret that machine learning is revolutionizing many industries. This is equally true in search, where companies exhaust themselves capturing nuance through manually tuned search relevance. Mature search organizations want to get past the “good enough” of manual tuning to build smarter, self-learning search systems.

That’s why we’re excited to release our Elasticsearch Learning to Rank Plugin. What is learning to rank? With learning to rank, a team trains a machine learning model to learn what users deem relevant.

When implementing Learning to Rank you need to:

  1. Measure what users deem relevant through analytics, to build a judgment list grading documents as exactly relevant, moderately relevant, not relevant, for queries
  2. Hypothesize which features might help predict relevance such as TF*IDF of specific field matches, recency, personalization for the searching user, etc.
  3. Train a model that can accurately map features to a relevance score
  4. Deploy the model to your search infrastructure, using it to rank search results in production

Don’t fool yourself: underneath each of these steps lie complex, hard technical and non-technical problems. There’s still no silver bullet. As we mention in Relevant Search, manual tuning of search results comes with many of the same challenges as a good learning to rank solution. We’ll have more to say about the many infrastructure, technical, and non-technical challenges of mature learning to rank solutions in future blog posts.

… (emphasis in original)

A great post as always but of particular interest for topic map fans is this passage:

Many of these features aren’t static properties of the documents in the search engine. Instead they are query dependent – they measure some relationship between the user or their query and a document. And to readers of Relevant Search, this is what we term signals in that book.
… (emphasis in original)

Do you read this as suggesting the merging exhibited to users should depend upon their queries?

That two or more users, with different query histories could (should?) get different merged results from the same topic map?

Now that’s an interesting suggestion!

Enjoy this post and follow the blog for more of same.

(I have a copy of Relevant Search waiting to be read so I had better get to it!)

Google Helps Spread Fake News [Fake News & Ad Revenue – Testing]

Saturday, December 10th, 2016

Google changed its search algorithm and that made it more vulnerable to the spread of fake news by Hannah Roberts.

From the post:

Google’s search algorithm has been changed over the last year to increasingly reward search results based on how likely you are to click on them, multiple sources tell Business Insider.

As a result, fake news now often outranks accurate reports on higher quality websites.

The problem is so acute that Google’s autocomplete suggestions now actually predict that you are searching for fake news even when you might not be, as Business Insider noted on December 5.

Hannah does a great job of setting for the evidence and opinions on the algorithm change but best summarizes it when she says:

The changes to the algorithm now move links up Google’s search results page if Google detects that more people are clicking on them, search experts tell Business Insider.

Just in case you don’t know:

more clicks != credible/useful search results

But it is true:

more clicks = more usage/ad revenue

Google and Facebook find “fake news” profitable. Both will make a great show of suppressing outlying “fake news,” but not so much as to impact profits.

There’s a data science “fake news” project:

Track the suppression of “fake news” by Google and Facebook against the performance of their ad revenue.

Hypotheses: When suppression of “fake news” impinges on ad revenue for more than two consecutive hours, dial back on suppression mechanisms. (ditto for 4, 6, 12 and 24 hour cycles)

Odds on Google and Facebook being transparent regard to suppression of “fake news” and ad revenue to make the results of testing that hypotheses verifiable?


Egyptological Museum Search

Tuesday, November 22nd, 2016

Egyptological Museum Search

From the post:

The Egyptological museum search is a PHP tool aimed to facilitate locating the descriptions and images of ancient Egyptian objects in online catalogues of major museums. Online catalogues (ranging from selections of highlights to complete digital inventories) are now offered by almost all major museums holding ancient Egyptian items and have become indispensable in research work. Yet the variety of web interfaces and of search rules may overstrain any person performing many searches in different online catalogues.

Egyptological museum search was made to provide a single search point for finding objects by their inventory numbers in major collections of Egyptian antiquities that have online catalogues. It tries to convert user input into search queries recognised by museums’ websites. (Thus, for example, stela Geneva D 50 is searched as “D 0050,” statue Vienna ÄS 5046 is searched as “AE_INV_5046,” and coffin Turin Suppl. 5217 is searched as “S. 05217.”) The following online catalogues are supported:

The search interface uses a short list of aliases for museums.

Once you see/use the interface proper, here, I hope you are interested in volunteering to improve it.

Guide to Making Search Relevance Investments, free ebook

Thursday, October 20th, 2016

Guide to Making Search Relevance Investments, free ebook

Doug Turnbull writes:

How well does search support your business? Are your investments in smarter, more relevant search, paying off? These are business-level questions, not technical ones!

After writing Relevant Search we find ourselves helping clients evaluate their search and discovery investments. Many invest far too little, or struggle to find the areas to make search smarter, unsure of the ROI. Others invest tremendously in supposedly smarter solutions, but have a hard time justifying the expense or understanding the impact of change.

That’s why we’re happy to announce OpenSource Connection’s official search relevance methodology!

The free ebook? Guide to Relevance Investments.

I know, I know, the title is a interest killer.

Think Search ROI. Not something you hear about often but it sounds attractive.

Runs 16 pages and is a blessed relief from the “data has value (unspecified)” mantras.

Search and investment in search is a business decision and this guide nudges you in that direction.

What you do next is up to you.


The Podesta Emails [In Bulk]

Wednesday, October 19th, 2016

Wikileaks has been posting:

The Podesta Emails, described as:

WikiLeaks series on deals involving Hillary Clinton campaign Chairman John Podesta. Mr Podesta is a long-term associate of the Clintons and was President Bill Clinton’s Chief of Staff from 1998 until 2001. Mr Podesta also owns the Podesta Group with his brother Tony, a major lobbying firm and is the Chair of the Center for American Progress (CAP), a Washington DC-based think tank.

long enough for them to be decried as “interference” with the U.S. presidential election.

You have two search options, basic:


and, advanced:


As handy as these search interfaces are, you cannot easily:

  • Analyze relationships between multiple senders and/or recipients of emails
  • Perform entity recognition across the emails as a corpus
  • Process the emails with other software
  • Integrate the emails with other data sources
  • etc., etc.

Michael Best, @NatSecGeek, is posting all the Podesta emails as they are released at: Podesta Emails (zipped).

As of Podesta Emails 13, there is approximately 2 GB of zipped email files available for downloading.

The search interfaces at Wikileaks may work for you, but if you want to get closer to the metal, you have Michael Best to thank for that opportunity!


Apache Lucene 6.2.1 and Apache Solr 6.2.1 Available [Presidential Data Leaks]

Thursday, September 22nd, 2016

Lucene can be downloaded from

Solr can be downloaded from

If you aren’t using Lucene/Solr 6.2, here’s your chance to grab the latest bug fixes as well!

Data leaks will accelerate as the US presidential election draws to a close.

What’s your favorite tool for analysis and delivery of data dumps?


NSA: Being Found Beats Searching, Every Time

Tuesday, September 20th, 2016

Equation Group Firewall Operations Catalogue by Mustafa Al-Bassam.

From the post:

This week someone auctioning hacking tools obtained from the NSA-based hacking group “Equation Group” released a dump of around 250 megabytes of “free” files for proof alongside the auction.

The dump contains a set of exploits, implants and tools for hacking firewalls (“Firewall Operations”). This post aims to be a comprehensive list of all the tools contained or referenced in the dump.

Mustafa’s post is a great illustration of why “being found beats searching, every time.”

Think of the cycles you would have to spend to duplicate this list. Multiple that by the number of people interested in this list. Assuming their time is not valueless, do you start to see the value-add of Mustafa’s post?

Mustafa found each of these items in the data dump and then preserved his finding for the use of others.

It’s not a very big step beyond this preservation to the creation of a container for each of these items, enabling the preservation of other material found on them or related to them.

Search is a starting place and not a destination.

Unless you enjoy repeating the same finding process over and over again.

Your call. Corrects Clinton-Impeachment Search Results

Monday, September 19th, 2016

After posting Search Alert: “…previous total of 261 to the new total of 0.” [Solved] yesterday, pointing out that a change from http:// to https:// altered a search result for Clinton w/in 5 words impeachment, I got an email this morning:


I appreciate the update and correction for saved searches, but my point about remote data changing without notice to you remains valid.

I’m still waiting for word on bulk downloads from both Wikileaks and DC Leaks.

Why leak information vital to public discussion and then limit access to search?

Introducing arxiv-sanity

Sunday, September 18th, 2016

Only a small part of Arxiv appears at: but it is enough to show the feasibility of this approach.

What captures my interest is the potential to substitute/extend the program to use other similarity measures.

Bearing in mind that searching is only the first step towards the acquisition and preservation of knowledge.

PS: I first saw this in a tweet by Data Science Renee. Search Alert: “…previous total of 261 to the new total of 0.” [Solved]

Sunday, September 18th, 2016

Odd message from the search alert this AM:


Here’s the search I created back in June, 2016:


My probably inaccurate recall at the moment was I was searching for some quote from the impeachment of Bill Clinton and was too lazy to specify a term of congress, hence:

all congresses – searching for Clinton within five words, impeachment

Fairly trivial search that produced 261 “hits.”

I set the search alert more to explore the search options than any expectation of different future results.

Imagine my surprise to find that all congresses – searching for Clinton within five words, impeachment performed today, results in 0 “hits.”

Suspecting some internal changes to the search interface, I re-entered the search today and got 0 “hits.”

Other saved searches with radically different search results as of today?

This is not, repeat not, the result of some elaborate conspiracy to assist Secretary Clinton in her bid for the presidency.

I do think something fundamental has gone wrong with searching at and it needs to be fixed.

This is an illustration of why Wikileaks, DC Leaks and other data sites should provide easy to access downloads in bulk of their materials.

Providing search interfaces to document collections is a public service, but document collections or access to them can change in ways not transparent to search users. Such as demonstrated by the CIA removing documents previously delivered to the Senate.

Petition Wikileaks, DC Leaks and other data sites for easy bulk downloads.

That will ensure the “evidence” will not shift under your feet and the availability of more sophisticated means of analysis than brute-force search.

Update: The changing from http:// to https:// by the site, trashed my save query and using http:// to re-perform the same search.

Using https:// returns the same 261 search results.

What your experience with other saved searches at

Five Essential Research Tips for Journalists Using Google

Saturday, July 2nd, 2016

Five Essential Research Tips for Journalists Using Google by Temi Adeoye.

This graphic:


does not appear in Temi’s post but rather in a tweet by the International Center For Journalism (ICFJ) about his post.

See Temi’s post for the details but this graphic is a great reminder.

This will make a nice addition to my local page of search links.

Visual Searching with Google – One Example – Neo4j – Raspberry Pi

Tuesday, April 26th, 2016

Just to show I don’t spend too much time thinking of ways to gnaw on the ankles of Süddeutsche Zeitung (SZ), the hoarders of the Panama Papers, here is my experience with visual searching with Google today.

I saw this image on Twitter:


I assumed that cutting the “clutter” from around the cluster might produce a better result. Besides, the plastic separators looked (to me) to be standard and not custom made.

Here is my cropped image for searching:


Google responded this looks like: “water.” 😉

OK, so I tried cropping it more just to show the ports, thinking that might turn up similar port arrangements, here’s that image:


Google says: “machinery.” With a number of amusing “similar” images.

BTW, when I tried the full image, the first one, Google says: “electronics.”

OK, so much for Google image searching. What if I try?

Searching on neo4j cluster and raspberry pi (the most likely suspect), my first “hit” had this image:


Same height as the search image.

My seventh “hit” has this image:


Same height and logo as the search image. That’s Stefan Armbruster next to the cluster. (He does presentations on building the cluster, but I have yet to find a video of one of those presentations.)

My eight “hit


Common wiring color (networking cable), height.

Definitely Raspberry Pi but I wasn’t able to uncover further details.

Very interested in seeing a video of Stefan putting one of these together!

UC Davis Spent $175,000.00 To Suppress This Image (let’s disappoint them)

Saturday, April 16th, 2016


If you have a few minutes, could you repost this image to your blog and/or Facebook page?

Some references you may want to cite include:

Pepper-sprayed students outraged as UC Davis tried to scrub incident from web by Anita Chabria

Calls for UC Davis chancellor’s ouster grow amid Internet scrubbing controversy by Sarah Parvini and Ruben Vives.

UC Davis Chancellor Faces Calls To Resign Over Pepper Spray Incident (NPR)

Katehi’s effort to alter search engine results backfires spectacularly

UC Davis’ damage control: Dumb-de-dumb-dumb

Reposting the image and links to the posts cited above will help disappoint the mislaid plans to suppress it.

What is more amazing than the chancellor thinking information on the Internet can be suppressed, at least for a paltry $175K, is this pattern will be repeated year after year.

Lying about information known to others is a losing strategy, always.

But that strategy will be picked up by other universities, governments and their agencies, corporations, to say nothing of individuals.

Had UC Davis spent that $175K on better training for its police officers, people would still talk about this event but it would be in contrast to the new and improved way US Davis deals with protesters.

That’s not likely to happen now.

Visualizing Data Loss From Search

Thursday, April 14th, 2016

I used searches for “duplicate detection” (3,854) and “coreference resolution” (3290) in “Ironically, Entity Resolution has many duplicate names” [Data Loss] to illustrate potential data loss in searches.

Here is a rough visualization of the information loss if you use only one of those terms:


If you search for “duplicate detection,” you miss all the articles shaded in blue.

If you search for “coreference resolution,” you miss all the articles shaded in yellow.

Suggestions for improving this visualization?

It is a visualization that could be performed on client’s data, using their search engine/database.

In order to identify the data loss they are suffering now from search across departments.

With the caveat that not all data loss is bad and/or worth avoiding.

Imaginary example (so far): What if you could demonstrate no overlapping of terminology for two vendors for the United States Army and the Air Force. That is no query terms for one returned useful results for the other.

That is a starting point for evaluating the use of topic maps.

While the divergence in terminologies is a given, the next question is: What is the downside to that divergence? What capability is lost due to that divergence?

Assuming you can identify such a capacity, the next question is to evaluate the cost of reducing and/or eliminating that divergence versus the claimed benefit.

I assume the most relevant terms are going to be those internal to customers and/or potential customers.

Interest in working this up into a client prospecting/topic map marketing tool?

Separately I want to note my discovery (you probably already knew about it) of VennDIS: a JavaFX-based Venn and Euler diagram software to generate publication quality figures. Download here. (Apologies, the publication itself if firewalled.)

The export defaults to 800 x 800 resolution. If you need something smaller, edit the resulting image in Gimp.

It’s a testimony to the software that I was able to produce a useful image in less than a day. Kudos to the software!

Lucene/Solr 6.0 Hits The Streets! (There goes the weekend!)

Friday, April 8th, 2016

From the Lucene PMC:

The Lucene PMC is pleased to announce the release of Apache Lucene 6.0.0 and Apache Solr 6.0.0

Lucene can be downloaded from
and Solr can be downloaded from

Highlights of this Lucene release include:

  • Java 8 is the minimum Java version required.
  • Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
  • Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
  • The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
  • Spatial4j has been updated to a new 0.6 version hosted by locationtech.
  • TermsQuery performance boost by a more aggressive default query caching policy.
  • IndexSearcher’s default Similarity is now changed to BM25Similarity.
  • Easier method of defining custom CharTokenizer instances.

Highlights of this Solr release include:

  • Improved defaults for “Similarity” used in Solr, in order to provide better default experience for new users.
  • Improved “Similarity” defaults for users upgrading: DefaultSimilarityFactory has been removed, implicit default Similarity has been changed to SchemaSimilarityFactory, and SchemaSimilarityFactory has been modified to use BM25Similarity as the default for field types that do not explicitly declare a Similarity.
  • Deprecated GET methods for schema are now accessible through the bulk API. The output has less details and is not backward compatible.
  • Users should set useDocValuesAsStored=”false” to preserve sort order on multi-valued fields that have both stored=”true” and docValues=”true”.
  • Formatted date-times are more consistent with ISO-8601. BC dates are now better supported since they are now formatted with a leading ‘-‘. AD years after 9999 have a leading ‘+’. Parse exceptions have been improved.
  • Deprecated SolrServer and subclasses have been removed, use SolrClient instead.
  • The deprecated configuration in solrconfig.xml has been removed. Users must remove it from solrconfig.xml.
  • SolrClient.shutdown() has been removed, use SolrClient.close() instead.
  • The deprecated zkCredientialsProvider element in solrcloud section of solr.xml is now removed. Use the correct spelling (zkCredentialsProvider) instead.
  • Added support for executing Parallel SQL queries across SolrCloud collections. Includes StreamExpression support and a new JDBC Driver for the SQL Interface.
  • New features and capabilities added to the streaming API.
  • Added support for SELECT DISTINCT queries to the SQL interface.
  • New GraphQuery to enable graph traversal as a query operator.
  • New support for Cross Data Center Replication consisting of active/passive replication for separate SolrClouds hosted in separate data centers.
  • Filter support added to Real-time get.
  • Column alias support added to the Parallel SQL Interface.
  • New command added to switch between non/secure mode in zookeeper.
  • Now possible to use IP fragments in replica placement rules.

For features new to Solr 6.0, be sure to consult the unreleased Solr reference manual. (unreleased as of 8 April 2016)

Happy searching!

Serious Non-Transparency (+ work around)

Tuesday, March 29th, 2016

I mentioned yesterday in my post: Courses -> Texts: A Hidden Relationship, where I lamented the inability to find courses by their titles.

So you could easily discover the required/suggested texts for any given course. Like browsing a physical campus bookstore.

Obscurity is an “information smell” (to build upon Felienne‘s expansion of code smell to spreadsheets).

In this particular case, the “information smell” is skunk class.

I revisited today to extract its > 1200 bookstores for use in crawling a sample of those sites.

For ugly HTML, view the source of:

Parsing that is going to take time and surely there is an easy way to get a sample of the sites for mining.

The idea didn’t occur to me immediately but I noticed yesterday that the general form of web addresses was:

So, after some flailing about with the HTML from, I searched for “” and requested all the results.

I’m picking a random ten bookstores with law books for further searching.

Not a high priority but I am curious what lies behind the smoke, mirrors, complex HTML and poor interfaces.

Maybe something, maybe nothing. Won’t know unless we look.

PS: Perhaps a better query string: textbooks-and-course-materials

Suggested refinements?

Bias For Sale: How Much and What Direction Do You Want?

Tuesday, March 29th, 2016

Epstein and Robertson pitch it a little differently but that is the bottom line of: The search engine manipulation effect (SEME) and its possible impact on the outcomes of elections.


Internet search rankings have a significant impact on consumer choices, mainly because users trust and choose higher-ranked results more than lower-ranked results. Given the apparent power of search rankings, we asked whether they could be manipulated to alter the preferences of undecided voters in democratic elections. Here we report the results of five relevant double-blind, randomized controlled experiments, using a total of 4,556 undecided voters representing diverse demographic characteristics of the voting populations of the United States and India. The fifth experiment is especially notable in that it was conducted with eligible voters throughout India in the midst of India’s 2014 Lok Sabha elections just before the final votes were cast. The results of these experiments demonstrate that (i) biased search rankings can shift the voting preferences of undecided voters by 20% or more, (ii) the shift can be much higher in some demographic groups, and (iii) search ranking bias can be masked so that people show no awareness of the manipulation. We call this type of influence, which might be applicable to a variety of attitudes and beliefs, the search engine manipulation effect. Given that many elections are won by small margins, our results suggest that a search engine company has the power to influence the results of a substantial number of elections with impunity. The impact of such manipulations would be especially large in countries dominated by a single search engine company.

I’m not surprised by SEME (search engine manipulation effect).

Although I would probably be more neutral and say: Search Engine Impact on Voting.

Whether you consider one result or another as the result of “manipulation” is a matter of perspective. No search engine strives to delivery “false” information to users.

Gary Anthes in Search Engine Agendas, Communications of the ACM, Vol. 59 No. 4, pages 19-21, writes:

In the novel 1984, George Orwell imagines a society in which powerful but hidden forces subtly shape peoples’ perceptions of the truth. By changing words, the emphases put on them, and their presentation, the state is able to alter citizens’ beliefs and behaviors in ways of which they are unaware.

Now imagine today’s Internet search engines did just that kind of thing—that subtle biases in search engine results, introduced deliberately or accidentally, could tip elections unfairly toward one candidate or another, all without the knowledge of voters.

That may seem an unlikely scenario, but recent research suggests it is quite possible. Robert Epstein and Ronald E. Robertson, researchers at the American Institute for Behavioral Research and Technology, conducted experiments that showed the sequence of results from politically oriented search queries can affect how users vote, especially among undecided voters, and biased rankings of search results usually go undetected by users. The outcomes of close elections could result from the deliberate tweaking of search algorithms by search engine companies, and such manipulation would be extremely difficult to detect, the experiments suggest.

Gary’s post is a good supplement to the original article, covering some of the volunteers who are ready to defend the rest of us from biased search results.

Or as I would put it, to inject their biases into search results as opposed to other biases they perceive as being present.

If you are more comfortable describing the search results you want presented as “fair and equitable,” etc., please do so but I prefer the honesty of naming biases as such.

Or as David Bowie once said:

Make your desired bias, direction, etc., a requirement and allow data scientists to get about the business of conveying it.

Certainly what “ethical” data scientists are doing at Google as they conspire with the US government and others to overthrow governments, play censor to fight “terrorists,” and undertake other questionable activities.

I object to some of Google’s current biases because I would have them be biased in a different direction.

Let’s sell your bias/perspective to users with a close eye on the bright line of the law.


Courses -> Texts: A Hidden Relationship

Monday, March 28th, 2016

Quite by accident I discovered the relationship between courses and their texts is hidden in many (approx. 2000) campus bookstore interfaces.

If you visit a physical campus bookstore you can browse courses for their textbooks. Very useful if you are interested the subject but not taking the course.

An online LLM (master’s of taxation) flyer prompted me to check the textbooks for the course work.

A simple enough information request. Find the campus bookstore and browse by course for text listings.

Not so fast!

The online presences of over 1200 campus bookstores are delivered, which offers this interface:


Another 748 campus bookstores are delivered by, with a similar interface for textbooks:


I started this post by saying the relationship between courses and their texts is hidden, but that’s not quite right.

The relationship between a meaningless course number and its required/suggested text is visible, but the identification of a course by a numeric string is hardly meaningful to the casual observer. (read not an enrolled student)

Perhaps better to say that a meaningful identification of courses for non-enrolled students and their relationship to required/suggested texts is absent.

That is the relationship of course -> text is present, but not in a form meaningful to anyone other than a student in that course.

Considering two separate vendors across almost 2,000 bookstores deliberately obscure the course -> text relationship, who has to wonder why?

I don’t have any immediate suggestions but when I encounter systematic obscuring of information across vendors, alarm bells start to go off.

Just for completeness sake, you can get around the obscuring of the course -> text relationship by searching for syllabus LLM taxation income OR estate OR corporate or (school name) syllabus LLM taxation income OR estate OR corporate. Extract required/suggested texts from posted syllabi.

PS: If you can offer advice on bookstore interfaces suggest enabling the browsing of courses by name and linking to the required/suggested texts.

During the searches I made writing this post, I encountered a syllabus on basic tax by Prof. Bret Wells which has this quote by Martin D. Ginsburg:

Basic tax, as everyone knows, is the only genuinely funny subject in law school.

Tax law does have an Alice in Wonderland quality about it, but The Hunting of the Snark: an Agony in Eight Fits is probably the closer match.

#AlphaGo Style Monte Carlo Tree Search In Python

Sunday, March 27th, 2016

Raymond Hettinger (@raymondh) tweeted the following links for anyone who wants an #AlphaGo style Monte Carlo Tree Search in Python:

Introduction to Monte Carlo Tree Search by Jeff Bradberry.

Monte Carlo Tree Search by Cameron Browne.

Jeff’s post is your guide to Monte Carlo Tree Search in Python while Cameron’s site bills itself as:

This site is intended to provide a comprehensive reference point for online MCTS material, to aid researchers in the field.

I didn’t see any dated later than 2010 on Cameron’s site.

Suggestions for other collections of MCTS material that are more up to date?

Searching with Google?

Tuesday, March 8th, 2016

Not critical but worth mentioning.

I saw:

(March 08, 2016) | 28,882 Hillary Clinton Emails

today and wanted to make sure it wasn’t a duplicate upload.

What better to do than invoke Google Advanced Search with:

all these words: Hillary emails

site or domain:

Here are my results:


My first assumption was that Google simply had not updated for this “new” content. Happens. Unexpected for an important site like, but mistakes do happen.

So I skipped back to search for:

(March 05, 2016) | FBI file Langston Hughes | via F.B. Eyes

Search request:

all these words: Langston Hughes

site or domain:

My results:


The indexed files end with March 05, 2016 and files after March 06, 2016 are not indexed, as of 08 March 2016.

Here’s what the listing at That 1 Archive looked like 08 March 2016:


Google is of course free to choose the frequency of its indexing of any site.

Just a word to the wise if you have scripted advanced searches, check the frequency of Google indexing updates for sites of interest.

It may not be the indexing updates you would expect. (I would have expected That 1 Archive to be nearly simultaneous with uploading. Apparently not.)

Patent Sickness Spreads [Open Source Projects on Prior Art?]

Tuesday, March 8th, 2016

James Cook reports a new occurrence of patent sickness in Facebook has an idea for software that detects cool new slang before it goes mainstream.

The most helpful part of James’ post is the graphic outline of the “process” patented by Facebook:


I sure do hope James has not patented that presentation because it make the Facebook patent, err, clear.

Quick show of hands on originality?

While researching this post, I ran across Open Source as Prior Art at the Linux Foundation. Are there other public projects that research and post prior art with regard to particular patents?

An armory of weapons for opposing ill-advised patents.

The Facebook patent is: 9,280,534 Hauser, et al. March 8, 2016, Generating a social glossary:

Its abstract:

Particular embodiments determine that a textual term is not associated with a known meaning. The textual term may be related to one or more users of the social-networking system. A determination is made as to whether the textual term should be added to a glossary. If so, then the textual term is added to the glossary. Information related to one or more textual terms in the glossary is provided to enhance auto-correction, provide predictive text input suggestions, or augment social graph data. Particular embodiments discover new textual terms by mining information, wherein the information was received from one or more users of the social-networking system, was generated for one or more users of the social-networking system, is marked as being associated with one or more users of the social-networking system, or includes an identifier for each of one or more users of the social-networking system. (emphasis in original)

U.S. Patents Requirements: Novel/Non-Obvious or Patent Fee?

Monday, February 22nd, 2016

IBM brags about its ranking in patents granted, IBM First in Patents for 23rd Consecutive Year, and is particularly proud of patent 9087304, saying:

We’ve all been served up search results we weren’t sure about, whether they were for “the best tacos in town” or “how to tell if your dog has eaten chocolate.” With IBM Patent no. 9087304, you no longer have to second-guess the answers you’re given. This new tech helps cognitive machines find the best potential answers to your questions by thinking critically about the trustworthiness and accuracy of each source. Simply put, these machines can use their own judgment to separate the right information from wrong. (From:

Did you notice that the 1st for 23 years post did not have a single link for any of the patents mentioned?

You would think IBM would be proud enough to link to its new patents and especially 9087304, that “…separate[s] right information from wrong.”

But if you follow the link for 9087304, you get an impression of one reason IBM didn’t include the link.

The abstract for 9087304 reads:

Method, computer program product, and system to perform an operation for a deep question answering system. The operation begins by computing a concept score for a first concept in a first case received by the deep question answering system, the concept score being based on a machine learning concept model for the first concept. The operation then excludes the first concept from consideration when analyzing a candidate answer and an item of supporting evidence to generate a response to the first case upon determining that the concept score does not exceed a predefined concept minimum weight threshold. The operation then increases a weight applied to the first concept when analyzing the candidate answer and the item of supporting evidence to generate the response to the first case when the concept score exceeds a predefined maximum weight threshold.

I will spare you further recitations from the patent.

Show of hands, do U.S. Patents always require:

  1. novel/non-obvious ideas
  2. patent fee
  3. #2 but not #1


Judge rankings by # of patents granted accordingly.

How to find breaking news on Twitter

Friday, February 19th, 2016

How to find breaking news on Twitter by Ruben Bouwmeester, Julia Bayer, and Alastair Reid.

From the post:

By its very nature, breaking news happens unexpectedly. Simply waiting for something to start trending on Twitter is not an option for journalists – you’ll have to actively seek it out.

The most important rule is to switch perspectives with the eyewitness and ask yourself, “What would I tweet if I were an eyewitness to an accident or disaster?”

To find breaking news on Twitter you have to think like a person who’s experiencing something out of the ordinary. Eyewitnesses tend to share what they see unfiltered and directly on social media, usually by expressing their first impressions and feelings. Eyewitness media can include very raw language that reflects the shock felt as a result of the situation. These posts often include misspellings.

In this article, we’ll outline some search terms you can use in order to find breaking news. The list is not intended as exhaustive, but a starting point on which to build and refine searches on Twitter to find the latest information.

Great collections of starter search terms but those are going to vary depending on your domain of “breaking” news.

Good illustration of use of Twitter search operators.

Other collections of Twitter search terms?

Does Not Advertise With Google? (Rigging Search Results)

Monday, February 8th, 2016

I ask about because when I search on Google with the string:

honest society member

I get 82,100,000 “hits” and the first page is entirely, honor society stuff.

No, “did you mean,” or “displaying results for…”, etc.

Not a one.

Top of the second page of results did have a webpage that mentions, but not their home site.

I can’t recall seeing an Honestsociety ad with Google and thought perhaps one of you might.

Lacking such ads, my seat of the pants explanation for “honest society member” returning the non-responsive “honor society” listing isn’t very generous.

What anomalies have you observed in Google (or other) search results?

What searches would you use to test ranking in search results by advertiser with Google versus non-advertiser with Google?

Rigging Searches

For my part, it isn’t a question of whether search results are rigged or not, but rather are they rigged the way I or my client prefers?

Or to say it in a positive way: All searches are rigged. If you think otherwise, you haven’t thought very deeply about the problem.

Take library searches for example. Do you think they are “fair” in some sense of the word?

Hmmm, would you agree that the collection practices of a library will give a user an impression of the literature on a subject?

So the search itself isn’t “rigged,” but the data underlying the results certainly influences the outcome.

If you let me pick the data, I can guarantee whatever search result you want to present. Ditto for the search algorithms.

The best we can do is make our choices with regard to the data and algorithms explicit, so that others accept our “rigged” data or choose to “rig” it differently.

Reverse Image Search (TinEye) [Clue to a User Topic Map Interface?]

Wednesday, February 3rd, 2016

TinEye was mentioned in a post I wrote in 2015, Baltimore Burning and Verification, but I did not follow up at the time.

Unlike some US intelligence agencies, TinEye has a cool logo:


Free registration enables you to share search results with others, an important feature for news teams.

I only tested the plugin for Chrome, but it offers useful result options:


Once installed, use by hovering over an image in your browser, right “click” and select “Search image on TinEye.” Your results will be presented as set under options.

Clue to User Topic Map Interface

That is a good example of how one version of a topic map interface should work. Select some text, right “click” and “Search topic map ….(preset or selection)” with configurable result display.

That puts you into interaction with the topic map, which can offer properties to enable you to refine the identification of a subject of interest and then a merged presentation of the results.

As with a topic map, all sorts of complicated things are happening in the background with the TinEye extension.

But as a user, I’m interested in the results that FireEye presents not how it got them.

I used to say “more interested” to indicate I might care how useful results came to be assembled. That’s a pretension that isn’t true.

It might be true in some particular case, but for the vast majority of searches, I just want the (uncensored Google) results.

US Intelligence Community Logo for Same Capability

I discovered the most likely intelligence community logo for a similar search program:


The answer to the age-old question of “who watches the watchers?” is us. Which watchers are you watching?

A Comprehensive Guide to Google Search Operators

Sunday, January 24th, 2016

A Comprehensive Guide to Google Search Operators by Marcela De Vivo.

From the post:

Google is, beyond question, the most utilized and highest performing search engine on the web. However, most of the users who utilize Google do not maximize their potential for getting the most accurate results from their searches.

By using Google Search Operators, you can find exactly what you are looking for quickly and effectively just by changing what you input into the search bar.

If you are searching for something simple on Google like [Funny cats] or [Francis Ford Coppola Movies] there is no need to use search operators. Google will return the results you are looking for effectively no matter how you input the words.

Note: Throughout this article whatever is in between these brackets [ ] is what is being typed into Google.

When [Francis Ford Coppola Movies] is typed into Google, Google reads the query as Francis AND Ford AND Coppola AND Movies. So Google will return pages that have all those words in them, with the most relevant pages appearing first. Which is fine when you’re searching for very broad things, but what if you’re trying to find something specific?

What happens when you’re trying to find a report on the revenue and statistics from the United States National Park System in 1995 from a reliable source, and no using Wikipedia.

I can’t say that Marcela’s guide is comprehensive for Google in 2016, because I am guessing the post was written in 2013. Hard to say if early or late 2013 without more research than I am willing donate. Dating posts makes it easy for readers to spot current or past-use-date information.

For the information that is present, this is a great presentation and list of operators.

One way to use this post is to work through every example but use terms from your domain.

If you are mining the web for news reporting, compete against yourself on successive stories or within a small group.

Great resource for creating a search worksheet for classes.

Searching for Geolocated Posts On YouTube

Sunday, January 3rd, 2016

Searching for Geolocated Posts On YouTube (video) by First Draft News.

Easily the most information filled 1 minutes and 18 seconds of the holiday season!

Illustrates searching for geolocated post to YouTube, despite YouTube not offering that option!

New tool in development may help!


Both the video and site are worth a visit!

Don’t forget to check out First Draft News as well!

Previously Unknown Hip Replacement Side Effect: Web Crawler Writing In Python

Saturday, December 12th, 2015

Crawling the web with Python 3.x by Doug Mahugh.

From the post:

These days, most everyone is familiar with the concept of crawling the web: a piece of software that systematically reads web pages and the pages they link to, traversing the world-wide web. It’s what Google does, and countless tech firms crawl web pages to accomplish tasks ranging from searches to archiving content to statistical analyses and so on. Web crawling is a task that has been automated by developers in every programming language around, many times — for example, a search for web crawling source code yields well over a million hits.

So when I recently came across a need to crawl some web pages for a project I’ve been working on, I figured I could just go find some source code online and hack it into what I need. (Quick aside: the project is a Python library for managing EXIF metadata on digital photos. More on that in a future blog post.)

But I spent a couple of hours searching and playing with the samples I found, and didn’t get anywhere. Mostly because I’m working in Python version 3, and the most popular Python web crawling code is Scrapy, which is only available for Python 2. I found a few Python 3 samples, but they all seemed to be either too trivial (not avoiding re-scanning the same page, for example) or too needlessly complex. So I decided to write my own Python 3.x web crawler, as a fun little learning exercise and also because I need one.

In this blog post I’ll go over how I approached it and explain some of the code, which I posted on GitHub so that others can use it as well.

Doug has been writing publicly about his hip replacement surgery so I don’t think this has any privacy issues. 😉

I am interested to see what he writes once he is fully recovered.

My contacts at the American Medical Association disavow any knowledge of hip replacement surgery driving patients to write in Python and/or to write web crawlers.

I suppose there could be liability implications, especially for C/C++ programmers who lose their programming skills except for Python following such surgery.

Still, glad to hear Doug has been making great progress and hope that it continues!

arXiv Sanity Preserver

Sunday, November 29th, 2015

arXiv Sanity Preserver by Andrej Karpathy.

From the webpage:

There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.

It’s super hacky and was written in 4 hours. I’ll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want – I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.

Main functionality is a search feature, and most useful is that you can click “sort by tfidf similarity to this”, which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.


You can see this rather remarkable tool online at:

Beyond its obvious utility for researchers, this could be used as a framework for experimenting with other similarity measures.


I first saw this in a tweet by Lynn Cherny.