Archive for the ‘Searching’ Category

Haystack: The Search Relevance Conference! (Proposals by Jan. 19, 2018) Updated

Friday, December 8th, 2017

Haystack: The Search Relevance Conference!

From the webpage:

Haystack is the conference for improving search relevance. If you’re like us, you work to understand the shiny new tools or dense academic papers out there that promise the moon. Then you puzzle how to apply those insights to your search problem, in your search stack. But the path isn’t always easy, and the promised gains don’t always materialize.

Haystack is the no-holds-barred conference for organizations where search, matching, and relevance really matters to the bottom line. For search managers, developers & data scientists finding ways to innovate, see past the silver bullets, and share what actually has worked well for their unique problems. Please come share and learn!

… (inline form for submission proposals)

Welcome topics include

  • Information Retrieval
  • Learning to Rank
  • Query Understanding
  • Semantic Search
  • Applying NLP to search
  • Personalized Search
  • Search UX Strategy: Perceived relevance, smart snippeting
  • Measuring and testing search against business objectives
  • Nuts & bolts: plugins for Solr, Elasticsearch, Vespa, etc
  • Adjacent topics: recommendation systems, entity matching via search, and other topics

… (emphasis in original)

The first link for the conference I saw was http://mailchi.mp/e609fba68dc6/announcing-haystack-the-search-relevance-conference, which promised topics including:

  • Intent detection

The modest price of $75 covers our costs….

To see a solution to the problem of other minds and to discover their intent, all for $75, is quite a bargain. Especially since the $75 covers breakfast and lunch both days, plus dinner the first day in a beer hall. 😉

Even without solving philosophical problems, sponsorship by OpenSource Connections is enough to recommend this conference without reservation.

My expectation is this conference is going to rock for hard core search geeks!

PS: Ask if videos will be posted. Thanks!

When Good Enough—Isn’t [Search Engine vs. Librarian Challenge]

Wednesday, December 6th, 2017

When Good Enough—Isn’t by Patti Brennan.

From the post:

Why do we need librarians when we have Google?

What is the role of a librarian now that we can google anything?

How often have you heard that?

Let’s face it: We have all become enticed by the immediacy of the answers that search engines provide, and we’ve come to accept the good-enough answer—even when good enough isn’t.

When I ask a librarian for help, I am tapping not only into his or her expertise, but also into that of countless others behind the scenes.

From the staff who purposefully and thoughtfully develop the collection—guided by a collection development manual other librarians have carefully crafted and considered—to the team of catalogers and indexers who assign metadata to the items we acquire, to the technical staff who design the systems that make automated search possible, we’ve got a small army of librarians supporting my personal act of discovery…and yours.
… (emphasis in original)

A great read to pass along to search fans in your office!

The image of tapping into the wisdom of countless others (dare I say the “crowd?”) behind every librarian is an apt one.

With search engines, you are limited to your expertise and yours alone. No backdrop of indexers, catalogers, metadata experts, to say nothing of those contributing to all those areas.

Compared to a librarian, you are out-classed and over matched, badly.

Are you ready to take Brennan’s challenge:

Let me offer a challenge: The next time you have a substantive question, ask a librarian and then report back here about how it went.

Ping me if you take Brennan up on that challenge. We are all want to benefit from your experience.

PS: Topic maps can build a backdrop of staff wisdom for you or you can wing every decision anew. Which one do you think works better?

Onion Deep Web Link Directory

Tuesday, November 28th, 2017

Onion Deep Web Link Directory (http://4bcdydpud5jbn33s.onion/)

Without a .onion address in hand, you will need to consult an .onion link list.

This .onion link list offers:

  • Hidden Service Lists and search engines – 23 links
  • Marketplace financial and drugs – 25 links
  • Hosting – 6 links
  • Blogs – 18 links
  • Forums and Chans – 12 links
  • Email and Messaging – 8 links
  • Political – 11 links
  • Hacking – 4 links
  • Warez – 12 links
  • Erotic 18+ – 7 links
  • Non-English – 18 links

Not an overwhelming number of links but enough to keep you and a Tor browser busy over the coming holiday season.

FYI, adult/erotic content sites are a primary means for the distribution of malware.

Hostile entity rules of engagement apply at all .onion addresses. (Just as no one “knows” you are a dog on the Internet, an entity found at a .onion address, could be the FBI. Act accordingly.)

I first saw this in the Hunchly Daily Hidden Services Report for 2017-11-03.

Smart HTML Form Trick

Monday, October 30th, 2017

An HTML form trick to add some convenience to life by Bob DuCharme.

From the post:

On the computers that I use the most, the browser home page is an HTML file with links to my favorite pages and a “single” form that lets me search the sites that I search the most. I can enter a search term in the field for any of the sites, press Enter, and then that site gets searched. The two tricks that I use to create these fields have been handy enough that I thought I’d share them in case they’re useful to others.

I quote the word “single” above because it appears to be a single form but is actually multiple little forms in the HTML. Here is an example with four of my entries; enter something into any of the fields and press Enter to see what I mean:

As always, an immediately useful tip from DuCharme!

The multiple search boxes reminded me of the early metasearch engines that combined results from multiple search engines.

Will vary by topic but what resources would you search across day to day?

When You Say “Google,” You Mean #GCensor

Tuesday, August 8th, 2017

Google Blocking Key Search Terms For Left Websites by Andre Damon.

From the post:

Note: In a previous article we reported that Popular Resistance had also seen more than a 60% drop in visits to our website since April when Google changed its search functions. This report goes further into how Google is blocking key search terms. See Google’s New Search Protocol Restricting Access To Leading Leftist Web Sites. KZ

Google blocked every one of the WSWS’s 45 top search terms

An intensive review of Internet data has established that Google has severed links between the World Socialist Web Site and the 45 most popular search terms that previously directed readers to the WSWS. The physical censorship implemented by Google is so extensive that of the top 150 search terms that, as late as April 2017, connected the WSWS with readers, 145 no longer do so.

These findings make clear that the decline in Google search traffic to the WSWS is not the result of some technical issue, but a deliberate policy of censorship. The fall took place in the three months since Google announced on April 25 plans to promote “authoritative web sites” above those containing “offensive” content and “conspiracy theories.”

Because of these measures, the WSWS’s search traffic from Google has fallen by two-thirds since April.

The WSWS has analyzed tens of thousands of search terms, and identified those key phrases and words that had been most likely to place the WSWS on the first or second page of search results. The top 45 search terms previously included “socialism,” “Russian revolution,” “Flint Michigan,” “proletariat,” and “UAW [United Auto Workers].” The top 150 results included the terms “UAW contract,” “rendition” and “Bolshevik revolution.” All of these terms are now blocked.
… (emphasis in original)

In addition to censoring “hate speech” and efforts such as: Google Says It Will Do More to Suppress Terrorist Propaganda, now there is evidence that Google is tampering with search results for simply left-wing websites.

Promote awareness of the censorship by Google, Facebook and Twitter, by using #GCensor, #FCensor, and #TCensor, respectively, for them.

I don’t expect to change the censorship behavior of #GCensor, #FCensor, and #TCensor. The remedy is non-censored alternatives.

All three have proven themselves untrustworthy guardians of free speech.

DMCA Complaint As Finding Aid

Thursday, August 3rd, 2017

Credit where credit is due, I saw this idea in How to Get Past DMCA Take-Downs in Google Search and report it here, sans the video.

The gist of the idea is that DMCA complaints, found at: Lumen, specify in the case of search engines, links that should not be displayed to users.

In a Google search result, content subject to a DMCA complaint will appear as:

In response to multiple complaints we received under the US Digital Millennium Copyright Act, we have removed 2 results from this page. If you wish, you may read the DMCA complaints that caused the removals at LumenDatabase.org: Complaint, Complaint.

If you follow the complaint links, knowing Google is tracking your following of those links, the complaints list the URLs to be removed from search results.

You can use the listed URLs to verify the presence of illegal content, compile lists of sites with such content, etc.

Enjoy!

PS: I’m adding their RSS feed of new notices. You should too.

The Cartoon Bank

Friday, May 12th, 2017

The Cartoon Bank by the Condé Nast Collection.

While searching for a cartoon depicting Sean Spicer at a White House news briefing, I encountered The Cartoon Bank.

A great source of instantly recognized cartoons but I’m still searching for one I remember from decades ago. 😉

Text Mining For Lawyers (The 55% Google Weaned Lawyers Are Missing)

Thursday, May 4th, 2017

Working the Mines: How Text Mining Can Help Create Value for Lawyers by Rees Morrison, Juris Datoris, Legaltech News.

From the post:

To most lawyers, text mining may sound like a magic wand or more hype regarding “artificial intelligence.” In fact, with the right input, text mining is a well-grounded genre of software that can find patterns and insights from large amounts of written material. So, if your law firm or law department has a sizable amount of text from various sources, it can extract value from that collection through powerful software tools.

To help lawyers recognize the potential of text mining and demystify it, this article digs through typical steps of a project. Terms of art related to this domain of software are in bold and, yes, there will be a quiz at the end.

Our example project assumes that your law firm (or law department) has gathered a raft of written comments through an internal survey of lawyers or from clients who have typed their views in a client satisfaction survey (perhaps in response to an open-ended question like “In what ways could we improve?”). All that writing is grist for the mill of text mining!

Great overview of the benefits and complexities of text mining!

I was recently assured by a Google weaned lawyer that natural language searching enabled him and his friends to do a few quick searches to find relevant authorities.

I could not help but point out my review of Blair and Maron’s work that demonstrated while attorneys estimated they recovered 75% of relevant documents, in fact they recovered barely 20%.

No solution returns 100% of the relevant documents for any non-trivial dataset, but leaving 55% on the floor doesn’t inspire confidence.

Especially when searchers consider a relevant result to be success. Depends.

Depends on how many relevant authorities existed and if any were closer to your facts than those found? Among other things.

Is a relevant result your test for research success or the best relevant research result, with a measure of confidence in it’s quality?

RegexBuddy (Think Occur Mode for Emacs)

Saturday, March 18th, 2017

RegexBuddy

From the webpage:

RegexBuddy is your perfect companion for working with regular expressions. Easily create regular expressions that match exactly what you want. Clearly understand complex regexes written by others. Quickly test any regex on sample strings and files, preventing mistakes on actual data. Debug without guesswork by stepping through the actual matching process. Use the regex with source code snippets automatically adjusted to the particulars of your programming language. Collect and document libraries of regular expressions for future reuse. GREP (search-and-replace) through files and folders. Integrate RegexBuddy with your favorite searching and editing tools for instant access.

Learn all there is to know about regular expressions from RegexBuddy’s comprehensive documentation and regular expression tutorial.

I was reminded of RegexBuddy when I stumbled on the RegexBuddy Manual in a search result.

The XQuery/XPath regex treatment is far briefer than I would like but at 500+ pages, it’s an impressive bit of work. Even without a copy of RegexBuddy, working through the examples will make you a regex terrorist.

The only unfortunate aspect, for *nix users, is that you need to run RegexBuddy in a Windows VM. 🙁

If you are comfortable with Emacs, Windows or otherwise, then the Occur mode comes to mind. It doesn’t have the visuals of RegexBuddy but then you are accustomed to a power-user environment.

In terms of productivity, it’s hard to beat regexes. I passed along a one liner awk regex tip today to extract content from a “…pile of nonstandard multiply redundant JavaScript infested pseudo html.”

I’ve seen the HTML in question. The description seems a bit generous to me. 😉

Try your hand at regexes and see if your productivity increases!

Can You Replicate Your Searches?

Thursday, February 16th, 2017

A comment at PubMed raises the question of replicating reported literature searches:

From the comment:

Mellisa Rethlefsen

I thank the authors of this Cochrane review for providing their search strategies in the document Appendix. Upon trying to reproduce the Ovid MEDLINE search strategy, we came across several errors. It is unclear whether these are transcription errors or represent actual errors in the performed search strategy, though likely the former.

For instance, in line 39, the search is “tumour bed boost.sh.kw.ti.ab” [quotes not in original]. The correct syntax would be “tumour bed boost.sh,kw,ti,ab” [no quotes]. The same is true for line 41, where the commas are replaced with periods.

In line 42, the search is “Breast Neoplasms /rt.sh” [quotes not in original]. It is not entirely clear what the authors meant here, but likely they meant to search the MeSH heading Breast Neoplasms with the subheading radiotherapy. If that is the case, the search should have been “Breast Neoplasms/rt” [no quotes].

In lines 43 and 44, it appears as though the authors were trying to search for the MeSH term “Radiotherapy, Conformal” with two different subheadings, which they spell out and end with a subject heading field search (i.e., Radiotherapy, Conformal/adverse events.sh). In Ovid syntax, however, the correct search syntax would be “Radiotherapy, Conformal/ae” [no quotes] without the subheading spelled out and without the extraneous .sh.

In line 47, there is another minor error, again with .sh being extraneously added to the search term “Radiotherapy/” [quotes not in original].

Though these errors are minor and are highly likely to be transcription errors, when attempting to replicate this search, each of these lines produces an error in Ovid. If a searcher is unaware of how to fix these problems, the search becomes unreplicable. Because the search could not have been completed as published, it is unlikely this was actually how the search was performed; however, it is a good case study to examine how even small details matter greatly for reproducibility in search strategies.

A great reminder that replication of searches is a non-trivial task and that search engines are literal to the point of idiocy.

We’re Bringing Learning to Rank to Elasticsearch [Merging Properties Query Dependent?]

Tuesday, February 14th, 2017

We’re Bringing Learning to Rank to Elasticsearch.

From the post:

It’s no secret that machine learning is revolutionizing many industries. This is equally true in search, where companies exhaust themselves capturing nuance through manually tuned search relevance. Mature search organizations want to get past the “good enough” of manual tuning to build smarter, self-learning search systems.

That’s why we’re excited to release our Elasticsearch Learning to Rank Plugin. What is learning to rank? With learning to rank, a team trains a machine learning model to learn what users deem relevant.

When implementing Learning to Rank you need to:

  1. Measure what users deem relevant through analytics, to build a judgment list grading documents as exactly relevant, moderately relevant, not relevant, for queries
  2. Hypothesize which features might help predict relevance such as TF*IDF of specific field matches, recency, personalization for the searching user, etc.
  3. Train a model that can accurately map features to a relevance score
  4. Deploy the model to your search infrastructure, using it to rank search results in production

Don’t fool yourself: underneath each of these steps lie complex, hard technical and non-technical problems. There’s still no silver bullet. As we mention in Relevant Search, manual tuning of search results comes with many of the same challenges as a good learning to rank solution. We’ll have more to say about the many infrastructure, technical, and non-technical challenges of mature learning to rank solutions in future blog posts.

… (emphasis in original)

A great post as always but of particular interest for topic map fans is this passage:


Many of these features aren’t static properties of the documents in the search engine. Instead they are query dependent – they measure some relationship between the user or their query and a document. And to readers of Relevant Search, this is what we term signals in that book.
… (emphasis in original)

Do you read this as suggesting the merging exhibited to users should depend upon their queries?

That two or more users, with different query histories could (should?) get different merged results from the same topic map?

Now that’s an interesting suggestion!

Enjoy this post and follow the blog for more of same.

(I have a copy of Relevant Search waiting to be read so I had better get to it!)

Google Helps Spread Fake News [Fake News & Ad Revenue – Testing]

Saturday, December 10th, 2016

Google changed its search algorithm and that made it more vulnerable to the spread of fake news by Hannah Roberts.

From the post:

Google’s search algorithm has been changed over the last year to increasingly reward search results based on how likely you are to click on them, multiple sources tell Business Insider.

As a result, fake news now often outranks accurate reports on higher quality websites.

The problem is so acute that Google’s autocomplete suggestions now actually predict that you are searching for fake news even when you might not be, as Business Insider noted on December 5.

Hannah does a great job of setting for the evidence and opinions on the algorithm change but best summarizes it when she says:


The changes to the algorithm now move links up Google’s search results page if Google detects that more people are clicking on them, search experts tell Business Insider.

Just in case you don’t know:

more clicks != credible/useful search results

But it is true:

more clicks = more usage/ad revenue

Google and Facebook find “fake news” profitable. Both will make a great show of suppressing outlying “fake news,” but not so much as to impact profits.

There’s a data science “fake news” project:

Track the suppression of “fake news” by Google and Facebook against the performance of their ad revenue.

Hypotheses: When suppression of “fake news” impinges on ad revenue for more than two consecutive hours, dial back on suppression mechanisms. (ditto for 4, 6, 12 and 24 hour cycles)

Odds on Google and Facebook being transparent regard to suppression of “fake news” and ad revenue to make the results of testing that hypotheses verifiable?

😉

Egyptological Museum Search

Tuesday, November 22nd, 2016

Egyptological Museum Search

From the post:

The Egyptological museum search is a PHP tool aimed to facilitate locating the descriptions and images of ancient Egyptian objects in online catalogues of major museums. Online catalogues (ranging from selections of highlights to complete digital inventories) are now offered by almost all major museums holding ancient Egyptian items and have become indispensable in research work. Yet the variety of web interfaces and of search rules may overstrain any person performing many searches in different online catalogues.

Egyptological museum search was made to provide a single search point for finding objects by their inventory numbers in major collections of Egyptian antiquities that have online catalogues. It tries to convert user input into search queries recognised by museums’ websites. (Thus, for example, stela Geneva D 50 is searched as “D 0050,” statue Vienna ÄS 5046 is searched as “AE_INV_5046,” and coffin Turin Suppl. 5217 is searched as “S. 05217.”) The following online catalogues are supported:

The search interface uses a short list of aliases for museums.

Once you see/use the interface proper, here, I hope you are interested in volunteering to improve it.

Guide to Making Search Relevance Investments, free ebook

Thursday, October 20th, 2016

Guide to Making Search Relevance Investments, free ebook

Doug Turnbull writes:

How well does search support your business? Are your investments in smarter, more relevant search, paying off? These are business-level questions, not technical ones!

After writing Relevant Search we find ourselves helping clients evaluate their search and discovery investments. Many invest far too little, or struggle to find the areas to make search smarter, unsure of the ROI. Others invest tremendously in supposedly smarter solutions, but have a hard time justifying the expense or understanding the impact of change.

That’s why we’re happy to announce OpenSource Connection’s official search relevance methodology!

The free ebook? Guide to Relevance Investments.

I know, I know, the title is a interest killer.

Think Search ROI. Not something you hear about often but it sounds attractive.

Runs 16 pages and is a blessed relief from the “data has value (unspecified)” mantras.

Search and investment in search is a business decision and this guide nudges you in that direction.

What you do next is up to you.

Enjoy!

The Podesta Emails [In Bulk]

Wednesday, October 19th, 2016

Wikileaks has been posting:

The Podesta Emails, described as:

WikiLeaks series on deals involving Hillary Clinton campaign Chairman John Podesta. Mr Podesta is a long-term associate of the Clintons and was President Bill Clinton’s Chief of Staff from 1998 until 2001. Mr Podesta also owns the Podesta Group with his brother Tony, a major lobbying firm and is the Chair of the Center for American Progress (CAP), a Washington DC-based think tank.

long enough for them to be decried as “interference” with the U.S. presidential election.

You have two search options, basic:

podesta-basic-search-460

and, advanced:

podesta-adv-search-460

As handy as these search interfaces are, you cannot easily:

  • Analyze relationships between multiple senders and/or recipients of emails
  • Perform entity recognition across the emails as a corpus
  • Process the emails with other software
  • Integrate the emails with other data sources
  • etc., etc.

Michael Best, @NatSecGeek, is posting all the Podesta emails as they are released at: Podesta Emails (zipped).

As of Podesta Emails 13, there is approximately 2 GB of zipped email files available for downloading.

The search interfaces at Wikileaks may work for you, but if you want to get closer to the metal, you have Michael Best to thank for that opportunity!

Enjoy!

Apache Lucene 6.2.1 and Apache Solr 6.2.1 Available [Presidential Data Leaks]

Thursday, September 22nd, 2016

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/6.2.1

Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/6.2.1

If you aren’t using Lucene/Solr 6.2, here’s your chance to grab the latest bug fixes as well!

Data leaks will accelerate as the US presidential election draws to a close.

What’s your favorite tool for analysis and delivery of data dumps?

Enjoy!

NSA: Being Found Beats Searching, Every Time

Tuesday, September 20th, 2016

Equation Group Firewall Operations Catalogue by Mustafa Al-Bassam.

From the post:

This week someone auctioning hacking tools obtained from the NSA-based hacking group “Equation Group” released a dump of around 250 megabytes of “free” files for proof alongside the auction.

The dump contains a set of exploits, implants and tools for hacking firewalls (“Firewall Operations”). This post aims to be a comprehensive list of all the tools contained or referenced in the dump.

Mustafa’s post is a great illustration of why “being found beats searching, every time.”

Think of the cycles you would have to spend to duplicate this list. Multiple that by the number of people interested in this list. Assuming their time is not valueless, do you start to see the value-add of Mustafa’s post?

Mustafa found each of these items in the data dump and then preserved his finding for the use of others.

It’s not a very big step beyond this preservation to the creation of a container for each of these items, enabling the preservation of other material found on them or related to them.

Search is a starting place and not a destination.

Unless you enjoy repeating the same finding process over and over again.

Your call.

Congress.gov Corrects Clinton-Impeachment Search Results

Monday, September 19th, 2016

After posting Congress.gov Search Alert: “…previous total of 261 to the new total of 0.” [Solved] yesterday, pointing out that a change from http:// to https:// altered a search result for Clinton w/in 5 words impeachment, I got an email this morning:

congress-gov-correction-460

I appreciate the update and correction for saved searches, but my point about remote data changing without notice to you remains valid.

I’m still waiting for word on bulk downloads from both Wikileaks and DC Leaks.

Why leak information vital to public discussion and then limit access to search?

Introducing arxiv-sanity

Sunday, September 18th, 2016

Only a small part of Arxiv appears at: http://www.arxiv-sanity.com/ but it is enough to show the feasibility of this approach.

What captures my interest is the potential to substitute/extend the program to use other similarity measures.

Bearing in mind that searching is only the first step towards the acquisition and preservation of knowledge.

PS: I first saw this in a tweet by Data Science Renee.

Congress.gov Search Alert: “…previous total of 261 to the new total of 0.” [Solved]

Sunday, September 18th, 2016

Odd message from the Congress.org search alert this AM:

congress-alert-460

Here’s the search I created back in June, 2016:

congress-alert-search-460

My probably inaccurate recall at the moment was I was searching for some quote from the impeachment of Bill Clinton and was too lazy to specify a term of congress, hence:

all congresses – searching for Clinton within five words, impeachment

Fairly trivial search that produced 261 “hits.”

I set the search alert more to explore the search options than any expectation of different future results.

Imagine my surprise to find that all congresses – searching for Clinton within five words, impeachment performed today, results in 0 “hits.”

Suspecting some internal changes to the search interface, I re-entered the search today and got 0 “hits.”

Other saved searches with radically different search results as of today?

This is not, repeat not, the result of some elaborate conspiracy to assist Secretary Clinton in her bid for the presidency.

I do think something fundamental has gone wrong with searching at Congress.gov and it needs to be fixed.

This is an illustration of why Wikileaks, DC Leaks and other data sites should provide easy to access downloads in bulk of their materials.

Providing search interfaces to document collections is a public service, but document collections or access to them can change in ways not transparent to search users. Such as demonstrated by the CIA removing documents previously delivered to the Senate.

Petition Wikileaks, DC Leaks and other data sites for easy bulk downloads.

That will ensure the “evidence” will not shift under your feet and the availability of more sophisticated means of analysis than brute-force search.


Update: The changing from http:// to https:// by the congress.gov site, trashed my save query and using http:// to re-perform the same search.

Using https:// returns the same 261 search results.

What your experience with other saved searches at congress.gov?

Five Essential Research Tips for Journalists Using Google

Saturday, July 2nd, 2016

Five Essential Research Tips for Journalists Using Google by Temi Adeoye.

This graphic:

google-search-460

does not appear in Temi’s post but rather in a tweet by the International Center For Journalism (ICFJ) about his post.

See Temi’s post for the details but this graphic is a great reminder.

This will make a nice addition to my local page of search links.

Visual Searching with Google – One Example – Neo4j – Raspberry Pi

Tuesday, April 26th, 2016

Just to show I don’t spend too much time thinking of ways to gnaw on the ankles of Süddeutsche Zeitung (SZ), the hoarders of the Panama Papers, here is my experience with visual searching with Google today.

I saw this image on Twitter:

neo4j-cluster

I assumed that cutting the “clutter” from around the cluster might produce a better result. Besides, the plastic separators looked (to me) to be standard and not custom made.

Here is my cropped image for searching:

neo4j-cluster-cropped

Google responded this looks like: “water.” 😉

OK, so I tried cropping it more just to show the ports, thinking that might turn up similar port arrangements, here’s that image:

neo4j-cluster-ports

Google says: “machinery.” With a number of amusing “similar” images.

BTW, when I tried the full image, the first one, Google says: “electronics.”

OK, so much for Google image searching. What if I try?

Searching on neo4j cluster and raspberry pi (the most likely suspect), my first “hit” had this image:

1st-neo4j-hit

Same height as the search image.

My seventh “hit” has this image:

bruggen-cluster

Same height and logo as the search image. That’s Stefan Armbruster next to the cluster. (He does presentations on building the cluster, but I have yet to find a video of one of those presentations.)

My eight “hit

neo4j-8th

Common wiring color (networking cable), height.

Definitely Raspberry Pi but I wasn’t able to uncover further details.

Very interested in seeing a video of Stefan putting one of these together!

UC Davis Spent $175,000.00 To Suppress This Image (let’s disappoint them)

Saturday, April 16th, 2016

uc-davis-pic

If you have a few minutes, could you repost this image to your blog and/or Facebook page?

Some references you may want to cite include:

Pepper-sprayed students outraged as UC Davis tried to scrub incident from web by Anita Chabria

Calls for UC Davis chancellor’s ouster grow amid Internet scrubbing controversy by Sarah Parvini and Ruben Vives.

UC Davis Chancellor Faces Calls To Resign Over Pepper Spray Incident (NPR)

Katehi’s effort to alter search engine results backfires spectacularly

UC Davis’ damage control: Dumb-de-dumb-dumb

Reposting the image and links to the posts cited above will help disappoint the mislaid plans to suppress it.

What is more amazing than the chancellor thinking information on the Internet can be suppressed, at least for a paltry $175K, is this pattern will be repeated year after year.

Lying about information known to others is a losing strategy, always.

But that strategy will be picked up by other universities, governments and their agencies, corporations, to say nothing of individuals.

Had UC Davis spent that $175K on better training for its police officers, people would still talk about this event but it would be in contrast to the new and improved way US Davis deals with protesters.

That’s not likely to happen now.

Visualizing Data Loss From Search

Thursday, April 14th, 2016

I used searches for “duplicate detection” (3,854) and “coreference resolution” (3290) in “Ironically, Entity Resolution has many duplicate names” [Data Loss] to illustrate potential data loss in searches.

Here is a rough visualization of the information loss if you use only one of those terms:

duplicate-v-coreference-500-clipped

If you search for “duplicate detection,” you miss all the articles shaded in blue.

If you search for “coreference resolution,” you miss all the articles shaded in yellow.

Suggestions for improving this visualization?

It is a visualization that could be performed on client’s data, using their search engine/database.

In order to identify the data loss they are suffering now from search across departments.

With the caveat that not all data loss is bad and/or worth avoiding.

Imaginary example (so far): What if you could demonstrate no overlapping of terminology for two vendors for the United States Army and the Air Force. That is no query terms for one returned useful results for the other.

That is a starting point for evaluating the use of topic maps.

While the divergence in terminologies is a given, the next question is: What is the downside to that divergence? What capability is lost due to that divergence?

Assuming you can identify such a capacity, the next question is to evaluate the cost of reducing and/or eliminating that divergence versus the claimed benefit.

I assume the most relevant terms are going to be those internal to customers and/or potential customers.

Interest in working this up into a client prospecting/topic map marketing tool?


Separately I want to note my discovery (you probably already knew about it) of VennDIS: a JavaFX-based Venn and Euler diagram software to generate publication quality figures. Download here. (Apologies, the publication itself if firewalled.)

The export defaults to 800 x 800 resolution. If you need something smaller, edit the resulting image in Gimp.

It’s a testimony to the software that I was able to produce a useful image in less than a day. Kudos to the software!

Lucene/Solr 6.0 Hits The Streets! (There goes the weekend!)

Friday, April 8th, 2016

From the Lucene PMC:

The Lucene PMC is pleased to announce the release of Apache Lucene 6.0.0 and Apache Solr 6.0.0

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/6.0.0
and Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/6.0.0

Highlights of this Lucene release include:

  • Java 8 is the minimum Java version required.
  • Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
  • Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
  • The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
  • Spatial4j has been updated to a new 0.6 version hosted by locationtech.
  • TermsQuery performance boost by a more aggressive default query caching policy.
  • IndexSearcher’s default Similarity is now changed to BM25Similarity.
  • Easier method of defining custom CharTokenizer instances.

Highlights of this Solr release include:

  • Improved defaults for “Similarity” used in Solr, in order to provide better default experience for new users.
  • Improved “Similarity” defaults for users upgrading: DefaultSimilarityFactory has been removed, implicit default Similarity has been changed to SchemaSimilarityFactory, and SchemaSimilarityFactory has been modified to use BM25Similarity as the default for field types that do not explicitly declare a Similarity.
  • Deprecated GET methods for schema are now accessible through the bulk API. The output has less details and is not backward compatible.
  • Users should set useDocValuesAsStored=”false” to preserve sort order on multi-valued fields that have both stored=”true” and docValues=”true”.
  • Formatted date-times are more consistent with ISO-8601. BC dates are now better supported since they are now formatted with a leading ‘-‘. AD years after 9999 have a leading ‘+’. Parse exceptions have been improved.
  • Deprecated SolrServer and subclasses have been removed, use SolrClient instead.
  • The deprecated configuration in solrconfig.xml has been removed. Users must remove it from solrconfig.xml.
  • SolrClient.shutdown() has been removed, use SolrClient.close() instead.
  • The deprecated zkCredientialsProvider element in solrcloud section of solr.xml is now removed. Use the correct spelling (zkCredentialsProvider) instead.
  • Added support for executing Parallel SQL queries across SolrCloud collections. Includes StreamExpression support and a new JDBC Driver for the SQL Interface.
  • New features and capabilities added to the streaming API.
  • Added support for SELECT DISTINCT queries to the SQL interface.
  • New GraphQuery to enable graph traversal as a query operator.
  • New support for Cross Data Center Replication consisting of active/passive replication for separate SolrClouds hosted in separate data centers.
  • Filter support added to Real-time get.
  • Column alias support added to the Parallel SQL Interface.
  • New command added to switch between non/secure mode in zookeeper.
  • Now possible to use IP fragments in replica placement rules.

For features new to Solr 6.0, be sure to consult the unreleased Solr reference manual. (unreleased as of 8 April 2016)

Happy searching!

Serious Non-Transparency (+ work around)

Tuesday, March 29th, 2016

I mentioned http://www.bkstr.com/ yesterday in my post: Courses -> Texts: A Hidden Relationship, where I lamented the inability to find courses by their titles.

So you could easily discover the required/suggested texts for any given course. Like browsing a physical campus bookstore.

Obscurity is an “information smell” (to build upon Felienne‘s expansion of code smell to spreadsheets).

In this particular case, the “information smell” is skunk class.

I revisited http://www.bkstr.com/ today to extract its > 1200 bookstores for use in crawling a sample of those sites.

For ugly HTML, view the source of: http://www.bkstr.com/.

Parsing that is going to take time and surely there is an easy way to get a sample of the sites for mining.

The idea didn’t occur to me immediately but I noticed yesterday that the general form of web addresses was:

bookstore-prefix.bkstr.com

So, after some flailing about with the HTML from bkstr.com, I searched for “bkstr.com” and requested all the results.

I’m picking a random ten bookstores with law books for further searching.

Not a high priority but I am curious what lies behind the smoke, mirrors, complex HTML and poor interfaces.

Maybe something, maybe nothing. Won’t know unless we look.

PS: Perhaps a better query string:

www.bkstr.com textbooks-and-course-materials

Suggested refinements?

Bias For Sale: How Much and What Direction Do You Want?

Tuesday, March 29th, 2016

Epstein and Robertson pitch it a little differently but that is the bottom line of: The search engine manipulation effect (SEME) and its possible impact on the outcomes of elections.

Abstract:

Internet search rankings have a significant impact on consumer choices, mainly because users trust and choose higher-ranked results more than lower-ranked results. Given the apparent power of search rankings, we asked whether they could be manipulated to alter the preferences of undecided voters in democratic elections. Here we report the results of five relevant double-blind, randomized controlled experiments, using a total of 4,556 undecided voters representing diverse demographic characteristics of the voting populations of the United States and India. The fifth experiment is especially notable in that it was conducted with eligible voters throughout India in the midst of India’s 2014 Lok Sabha elections just before the final votes were cast. The results of these experiments demonstrate that (i) biased search rankings can shift the voting preferences of undecided voters by 20% or more, (ii) the shift can be much higher in some demographic groups, and (iii) search ranking bias can be masked so that people show no awareness of the manipulation. We call this type of influence, which might be applicable to a variety of attitudes and beliefs, the search engine manipulation effect. Given that many elections are won by small margins, our results suggest that a search engine company has the power to influence the results of a substantial number of elections with impunity. The impact of such manipulations would be especially large in countries dominated by a single search engine company.

I’m not surprised by SEME (search engine manipulation effect).

Although I would probably be more neutral and say: Search Engine Impact on Voting.

Whether you consider one result or another as the result of “manipulation” is a matter of perspective. No search engine strives to delivery “false” information to users.

Gary Anthes in Search Engine Agendas, Communications of the ACM, Vol. 59 No. 4, pages 19-21, writes:

In the novel 1984, George Orwell imagines a society in which powerful but hidden forces subtly shape peoples’ perceptions of the truth. By changing words, the emphases put on them, and their presentation, the state is able to alter citizens’ beliefs and behaviors in ways of which they are unaware.

Now imagine today’s Internet search engines did just that kind of thing—that subtle biases in search engine results, introduced deliberately or accidentally, could tip elections unfairly toward one candidate or another, all without the knowledge of voters.

That may seem an unlikely scenario, but recent research suggests it is quite possible. Robert Epstein and Ronald E. Robertson, researchers at the American Institute for Behavioral Research and Technology, conducted experiments that showed the sequence of results from politically oriented search queries can affect how users vote, especially among undecided voters, and biased rankings of search results usually go undetected by users. The outcomes of close elections could result from the deliberate tweaking of search algorithms by search engine companies, and such manipulation would be extremely difficult to detect, the experiments suggest.

Gary’s post is a good supplement to the original article, covering some of the volunteers who are ready to defend the rest of us from biased search results.

Or as I would put it, to inject their biases into search results as opposed to other biases they perceive as being present.

If you are more comfortable describing the search results you want presented as “fair and equitable,” etc., please do so but I prefer the honesty of naming biases as such.

Or as David Bowie once said:

Make your desired bias, direction, etc., a requirement and allow data scientists to get about the business of conveying it.

Certainly what “ethical” data scientists are doing at Google as they conspire with the US government and others to overthrow governments, play censor to fight “terrorists,” and undertake other questionable activities.

I object to some of Google’s current biases because I would have them be biased in a different direction.

Let’s sell your bias/perspective to users with a close eye on the bright line of the law.

Game?

Courses -> Texts: A Hidden Relationship

Monday, March 28th, 2016

Quite by accident I discovered the relationship between courses and their texts is hidden in many (approx. 2000) campus bookstore interfaces.

If you visit a physical campus bookstore you can browse courses for their textbooks. Very useful if you are interested the subject but not taking the course.

An online LLM (master’s of taxation) flyer prompted me to check the textbooks for the course work.

A simple enough information request. Find the campus bookstore and browse by course for text listings.

Not so fast!

The online presences of over 1200 campus bookstores are delivered http://www.bkstr.com/, which offers this interface:

bookstore-campus

Another 748 campus bookstores are delivered by http://bncollege.com/, with a similar interface for textbooks:

harvard-yale

I started this post by saying the relationship between courses and their texts is hidden, but that’s not quite right.

The relationship between a meaningless course number and its required/suggested text is visible, but the identification of a course by a numeric string is hardly meaningful to the casual observer. (read not an enrolled student)

Perhaps better to say that a meaningful identification of courses for non-enrolled students and their relationship to required/suggested texts is absent.

That is the relationship of course -> text is present, but not in a form meaningful to anyone other than a student in that course.

Considering two separate vendors across almost 2,000 bookstores deliberately obscure the course -> text relationship, who has to wonder why?

I don’t have any immediate suggestions but when I encounter systematic obscuring of information across vendors, alarm bells start to go off.

Just for completeness sake, you can get around the obscuring of the course -> text relationship by searching for syllabus LLM taxation income OR estate OR corporate or (school name) syllabus LLM taxation income OR estate OR corporate. Extract required/suggested texts from posted syllabi.

PS: If you can offer advice on bookstore interfaces suggest enabling the browsing of courses by name and linking to the required/suggested texts.


During the searches I made writing this post, I encountered a syllabus on basic tax by Prof. Bret Wells which has this quote by Martin D. Ginsburg:

Basic tax, as everyone knows, is the only genuinely funny subject in law school.

Tax law does have an Alice in Wonderland quality about it, but The Hunting of the Snark: an Agony in Eight Fits is probably the closer match.

#AlphaGo Style Monte Carlo Tree Search In Python

Sunday, March 27th, 2016

Raymond Hettinger (@raymondh) tweeted the following links for anyone who wants an #AlphaGo style Monte Carlo Tree Search in Python:

Introduction to Monte Carlo Tree Search by Jeff Bradberry.

Monte Carlo Tree Search by Cameron Browne.

Jeff’s post is your guide to Monte Carlo Tree Search in Python while Cameron’s site bills itself as:

This site is intended to provide a comprehensive reference point for online MCTS material, to aid researchers in the field.

I didn’t see any dated later than 2010 on Cameron’s site.

Suggestions for other collections of MCTS material that are more up to date?

Searching http://that1archive.neocities.org/ with Google?

Tuesday, March 8th, 2016

Not critical but worth mentioning.

I saw:

(March 08, 2016) | 28,882 Hillary Clinton Emails

today and wanted to make sure it wasn’t a duplicate upload.

What better to do than invoke Google Advanced Search with:

all these words: Hillary emails

site or domain: http://that1archive.neocities.org/

Here are my results:

hillary-email-search

My first assumption was that Google simply had not updated for this “new” content. Happens. Unexpected for an important site like http://that1archive.neocities.org/, but mistakes do happen.

So I skipped back to search for:

(March 05, 2016) | FBI file Langston Hughes | via F.B. Eyes

Search request:

all these words: Langston Hughes

site or domain: http://that1archive.neocities.org/

My results:

langston-search

The indexed files end with March 05, 2016 and files after March 06, 2016 are not indexed, as of 08 March 2016.

Here’s what the listing at That 1 Archive looked like 08 March 2016:

a1-march-08

Google is of course free to choose the frequency of its indexing of any site.

Just a word to the wise if you have scripted advanced searches, check the frequency of Google indexing updates for sites of interest.

It may not be the indexing updates you would expect. (I would have expected That 1 Archive to be nearly simultaneous with uploading. Apparently not.)