Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 28, 2019

How-To Black Box Google’s Algorithm of Oppression

Filed under: Algorithms,Bias,Feminism,Search Algorithms,Search Data,Searching,sexism — Patrick Durusau @ 6:55 pm

Safiya Noble’s Algorithms of Oppression highlights the necessity of asking members of marginalized communities about their experiences with algorithms. I can read the terms that Noble uses in her Google searches and her analysis of the results. What I can’t do, as a older white male, is authentically originate queries of a Black woman scholar or estimate her reaction to search results.

That inability to assume a role in a marginalized community extends across all marginalized communities and in between them. To understand the impact of oppressive algorithms, such as Google’s search algorithms, we must:

  1. Empower everyone who can use a web browser with the ability to black box Google’s algorithm of oppression, and
  2. Listen to their reports of queries and experiences with results of queries.

Enpowering everyone to participate in testing Google’s algorithms avoids relying on reports about the experiences of marginalized communities. We will be listening to members of those communities.

In it’s simplest form, your black boxing of Google start with a Google search box, then:

your search terms site:website OR site:website

That search string states your search terms and is then followed by an OR list of websites you want searched. The results are Google’s ranking of your search against specified websites.

Here’s an example ran while working on this post:

terrorism trump IS site:nytimes.com OR site:fox.com OR site:wsj.com

Without running the search yourself, what distribution of articles to you expect to see? (I also tested this using Tor to make sure my search history wasn’t creating an issue.)

By count of the results: nytimes.com 87, fox.com 0, wsj.com 18.

Suprised? I was. I wonder how the Washington Post stacks up against the New York Times? Same terms: nytimes 49, washingtonpost.com 52.

Do you think those differences are accidental? (I don’t.)

I’m not competent to create a list of Black websites for testing Google’s algorithm of oppression but the African American Literature Book Club has a list of the top 50 Black-Owned Websites. In addition, they offer a list of 300 Black-owned websites and host the search engine Huria Search, which only searches Black-owned websites.

To save you the extraction work, here are the top 50 Black-owned websites ready for testing against each other and other sites in the bowels of Google:

essence.com OR howard.edu OR blackenterprise.com OR thesource.com OR ebony.com OR blackplanet.com OR sohh.com OR blackamericaweb.com OR hellobeautiful.com OR allhiphop.com OR worldstarhiphop.com OR eurweb.com OR rollingout.com OR thegrio.com OR atlantablackstar.com OR bossip.com OR blackdoctor.org OR blackpast.org OR lipstickalley.com OR newsone.com OR madamenoire.com OR morehouse.edu OR diversityinc.com OR spelman.edu OR theybf.com OR hiphopwired.com OR aalbc.com OR stlamerican.com OR afro.com OR phillytrib.com OR finalcall.com OR mediatakeout.com OR lasentinel.net OR blacknews.com OR blavity.com OR cassiuslife.com OR jetmag.com OR blacklivesmatter.com OR amsterdamnews.com OR diverseeducation.com OR deltasigmatheta.org OR curlynikki.com OR atlantadailyworld.com OR apa1906.net OR theshaderoom.com OR notjustok.com OR travelnoire.com OR thecurvyfashionista.com OR dallasblack.com OR forharriet.com

Please spread the word to “young Black girls” to use Noble’s phrase, Black women in general, all marginalized communities, they need not wait for experts with programming staffs to detect marginalization at Google. Experts have agendas, discover your own and tell the rest of us about it.

November 21, 2018

Going Old School to Solve A Google Search Problem

Filed under: Bookmarking,Bookmarks,Javascript,Searching — Patrick Durusau @ 5:27 pm

Going Old School to Solve A Google Search Problem

I was completely gulled by the headline. I thought the “old school” solution was going to be:

Go ask a librarian.

My bad. Turns out the answer was:

Recently I got an email from my friend John Simpson. He was having a search problem and thought I might be able to help him out. I was, and wanted to share with you what I did, because a) you might be able to use it too and b) it’s not often in my Internet experience that you end up solving a problem using a method that was popular over ten years ago.

Here’s John’s problem: he does regular Google searches of a particular kind, but finds that with most of these searches he gets an overwhelming number of results from just a couple of sites. He wants to consistently exclude those sites from his search results, but he doesn’t want to have to type in the exclusions every time.

The rock-simple solution to this problem would be: do the Google search excluding all the sites you don’t want to see, bookmark the result, and then revisit that bookmark whenever you’re ready to search. But a more elegant solution would be to use an bookmark enhanced with JavaScript: a bookmarklet.

The rest of the post walks you through the creation of a simple bookmarketlet. Easier than the name promises.

When (not if) Google fails you, remember you can either visit (or call in many cases) the reference desk at your local library.

Under the title: Why You Should Fall To Your Knees And Worship A Librarian, I encountered this item:

I’ve always had a weakness for the line:

People become librarians because they know too much.

Google can quickly point you down any number of blind alleys. Librarians quickly provide you with productive avenues to pursue. Your call.

November 10, 2018

PyCoder’s Weekly Archive 2012-2018 [Indexing Data Set?]

Filed under: Indexing,Python,Search Engines,Searching — Patrick Durusau @ 8:53 pm

PyCoder’s Weekly Archive 2012-2018

Python programmers already know about PyCoder Weekly but if you don’t, it’s a weekly newsletter with headline Python news, discussions, Python jobs, articles & tutorials, projects & code, and events. Yeah, every week!

I mention it too as a potential indexing set for search software. I’m reasoning you are more likely to devote effort to indexing material of interest than out of copyright newspapers. Besides, you will be better able to judge a good search result from a bad one when indexing PyCoder’s Weekly.

Enjoy!

January 24, 2018

‘Learning to Rank’ (No Unique Feature Name Fail – Update)

Filed under: Artificial Intelligence,ElasticSearch,Ranking,Searching — Patrick Durusau @ 8:02 pm

Elasticsearch ‘Learning to Rank’ Released, Bringing Open Source AI to Search Teams

From the post:

Search experts at OpenSource Connections, the Wikimedia Foundation, and Snagajob, deliver open source cognitive search capabilities to the Elasticsearch community. The open source Learning to Rank plugin allows organizations to control search relevance ranking with machine learning. The plugin is currently delivering search results at Wikipedia and Snagajob, providing significant search quality improvements over legacy solutions.

Learning to Rank lets organizations:

  • Directly optimize sales, conversions and user satisfaction in search
  • Personalize search for users
  • Drive deeper insights from a knowledge base
  • Customize ranking down for complex nuance
  • Avoid the sticker shock & lock-in of a proprietary "cognitive search" product

“Our mission is to empower search teams. This plugin gives teams deep control of ranking, allowing machine learning models to be directly deployed to the search engine for relevance ranking” said Doug Turnbull, author of Relevant Search and CTO, OpenSource Connections.

I need to work through all the documentation and examples but:

Feature Names are Unique

Because some model training libraries refer to features by name, Elasticsearch LTR enforces unique names for each features. In the example above, we could not add a new user_rating feature without creating an error.

is a warning of what you (and I) are likely to find.

Really? Someone involved in the design thought globally unique feature names was a good idea? Or at a minimum didn’t realize it is a very bad idea?

Scope anyone? Either in the programming or topic map sense?

Despite the unique feature name fail, I’m sure ‘Learning to Rank’ will be useful. But not as useful as it could have been.

Doug Turnbull (https://twitter.com/softwaredoug) advises that features are scoped by feature stores, so the correct prose would read: “…LTR enforces unique names for each feature within a feature store.”

No fail, just bad writing.

January 12, 2018

Secrets to Searching for Video Footage (AI Assistance In Your Future?)

Filed under: Artificial Intelligence,Deep Learning,Journalism,News,Reporting,Searching — Patrick Durusau @ 11:24 am

Secrets to Searching for Video Footage by Aric Toler.

From the post:

Much of Bellingcat’s work requires intense research into particular events, which includes finding every possible photograph, video and witness account that will help inform our analysis. Perhaps most notably, we exhaustively researched the events surrounding the shoot down of Malaysian Airlines Flight 17 (MH17) over eastern Ukraine.

The photographs and videos taken near the crash in eastern Ukraine were not particularly difficult to find, as they were widely publicized. However, locating over a dozen photographs and videos of the Russian convoy transporting the Buk anti-aircraft missile launcher that shot down MH17 three weeks before the tragedy was much harder, and required both intense investigation on social networks and some creative thinking.

Most of these videos were shared on Russian-language social networks and YouTube, and did not involve another type of video that is much more important today than it was in 2014 — live streaming. Bellingcat has also made an effort to compile all user-generated videos of the events in Charlottesville on August 12, 2017, providing a database of livestreamed videos on platforms like Periscope, Ustream and Facebook Live, along with footage uploaded after the protest onto platforms like Twitter and YouTube.

Verifying videos is important, as detailed in this Bellingcat guide, but first you have to find them. This guide will provide advice and some tips on how to gather as much video as possible on a particular event, whether it is videos from witnesses of a natural disaster or a terrorist attack. For most examples in this guide, we will assume that the event is a large protest or demonstration, but the same advice is applicable to other events.

I was amused by this description of Snapchat and Instagram:


Snapchat and Instagram are two very common sources for videos, but also two of the most difficult platforms to trawl for clips. Neither has an intuitive search interface that easily allows researchers to sort through and collect videos.

I’m certain that’s true but a trained AI could sort out videos obtained by overly broad requests. As I’m fond of pointing out, not 100% accuracy but you can’t get that with humans either.

Augment your searching with a tireless AI. For best results, add or consult a librarian as well.

PS: I have other concerns at the moment but a subset of the Bellingcat Charlottesville database would make a nice training basis for an AI, which could then be loosed on Instagram and other sources to discover more videos. The usual stumbling block for AI projects being human curated material, which Bellingcat has already supplied.

December 8, 2017

Haystack: The Search Relevance Conference! (Proposals by Jan. 19, 2018) Updated

Filed under: Conferences,Relevance,Search Algorithms,Search Analytics,Searching — Patrick Durusau @ 5:16 pm

Haystack: The Search Relevance Conference!

From the webpage:

Haystack is the conference for improving search relevance. If you’re like us, you work to understand the shiny new tools or dense academic papers out there that promise the moon. Then you puzzle how to apply those insights to your search problem, in your search stack. But the path isn’t always easy, and the promised gains don’t always materialize.

Haystack is the no-holds-barred conference for organizations where search, matching, and relevance really matters to the bottom line. For search managers, developers & data scientists finding ways to innovate, see past the silver bullets, and share what actually has worked well for their unique problems. Please come share and learn!

… (inline form for submission proposals)

Welcome topics include

  • Information Retrieval
  • Learning to Rank
  • Query Understanding
  • Semantic Search
  • Applying NLP to search
  • Personalized Search
  • Search UX Strategy: Perceived relevance, smart snippeting
  • Measuring and testing search against business objectives
  • Nuts & bolts: plugins for Solr, Elasticsearch, Vespa, etc
  • Adjacent topics: recommendation systems, entity matching via search, and other topics

… (emphasis in original)

The first link for the conference I saw was http://mailchi.mp/e609fba68dc6/announcing-haystack-the-search-relevance-conference, which promised topics including:

  • Intent detection

The modest price of $75 covers our costs….

To see a solution to the problem of other minds and to discover their intent, all for $75, is quite a bargain. Especially since the $75 covers breakfast and lunch both days, plus dinner the first day in a beer hall. 😉

Even without solving philosophical problems, sponsorship by OpenSource Connections is enough to recommend this conference without reservation.

My expectation is this conference is going to rock for hard core search geeks!

PS: Ask if videos will be posted. Thanks!

December 6, 2017

When Good Enough—Isn’t [Search Engine vs. Librarian Challenge]

Filed under: Library,Search Behavior,Searching — Patrick Durusau @ 3:13 pm

When Good Enough—Isn’t by Patti Brennan.

From the post:

Why do we need librarians when we have Google?

What is the role of a librarian now that we can google anything?

How often have you heard that?

Let’s face it: We have all become enticed by the immediacy of the answers that search engines provide, and we’ve come to accept the good-enough answer—even when good enough isn’t.

When I ask a librarian for help, I am tapping not only into his or her expertise, but also into that of countless others behind the scenes.

From the staff who purposefully and thoughtfully develop the collection—guided by a collection development manual other librarians have carefully crafted and considered—to the team of catalogers and indexers who assign metadata to the items we acquire, to the technical staff who design the systems that make automated search possible, we’ve got a small army of librarians supporting my personal act of discovery…and yours.
… (emphasis in original)

A great read to pass along to search fans in your office!

The image of tapping into the wisdom of countless others (dare I say the “crowd?”) behind every librarian is an apt one.

With search engines, you are limited to your expertise and yours alone. No backdrop of indexers, catalogers, metadata experts, to say nothing of those contributing to all those areas.

Compared to a librarian, you are out-classed and over matched, badly.

Are you ready to take Brennan’s challenge:

Let me offer a challenge: The next time you have a substantive question, ask a librarian and then report back here about how it went.

Ping me if you take Brennan up on that challenge. We are all want to benefit from your experience.

PS: Topic maps can build a backdrop of staff wisdom for you or you can wing every decision anew. Which one do you think works better?

November 28, 2017

Onion Deep Web Link Directory

Filed under: Dark Web,Searching — Patrick Durusau @ 2:11 pm

Onion Deep Web Link Directory (http://4bcdydpud5jbn33s.onion/)

Without a .onion address in hand, you will need to consult an .onion link list.

This .onion link list offers:

  • Hidden Service Lists and search engines – 23 links
  • Marketplace financial and drugs – 25 links
  • Hosting – 6 links
  • Blogs – 18 links
  • Forums and Chans – 12 links
  • Email and Messaging – 8 links
  • Political – 11 links
  • Hacking – 4 links
  • Warez – 12 links
  • Erotic 18+ – 7 links
  • Non-English – 18 links

Not an overwhelming number of links but enough to keep you and a Tor browser busy over the coming holiday season.

FYI, adult/erotic content sites are a primary means for the distribution of malware.

Hostile entity rules of engagement apply at all .onion addresses. (Just as no one “knows” you are a dog on the Internet, an entity found at a .onion address, could be the FBI. Act accordingly.)

I first saw this in the Hunchly Daily Hidden Services Report for 2017-11-03.

October 30, 2017

Smart HTML Form Trick

Filed under: HTML,Search Interface,Searching — Patrick Durusau @ 7:37 pm

An HTML form trick to add some convenience to life by Bob DuCharme.

From the post:

On the computers that I use the most, the browser home page is an HTML file with links to my favorite pages and a “single” form that lets me search the sites that I search the most. I can enter a search term in the field for any of the sites, press Enter, and then that site gets searched. The two tricks that I use to create these fields have been handy enough that I thought I’d share them in case they’re useful to others.

I quote the word “single” above because it appears to be a single form but is actually multiple little forms in the HTML. Here is an example with four of my entries; enter something into any of the fields and press Enter to see what I mean:

As always, an immediately useful tip from DuCharme!

The multiple search boxes reminded me of the early metasearch engines that combined results from multiple search engines.

Will vary by topic but what resources would you search across day to day?

August 8, 2017

When You Say “Google,” You Mean #GCensor

Filed under: Censorship,Free Speech,Searching — Patrick Durusau @ 3:47 pm

Google Blocking Key Search Terms For Left Websites by Andre Damon.

From the post:

Note: In a previous article we reported that Popular Resistance had also seen more than a 60% drop in visits to our website since April when Google changed its search functions. This report goes further into how Google is blocking key search terms. See Google’s New Search Protocol Restricting Access To Leading Leftist Web Sites. KZ

Google blocked every one of the WSWS’s 45 top search terms

An intensive review of Internet data has established that Google has severed links between the World Socialist Web Site and the 45 most popular search terms that previously directed readers to the WSWS. The physical censorship implemented by Google is so extensive that of the top 150 search terms that, as late as April 2017, connected the WSWS with readers, 145 no longer do so.

These findings make clear that the decline in Google search traffic to the WSWS is not the result of some technical issue, but a deliberate policy of censorship. The fall took place in the three months since Google announced on April 25 plans to promote “authoritative web sites” above those containing “offensive” content and “conspiracy theories.”

Because of these measures, the WSWS’s search traffic from Google has fallen by two-thirds since April.

The WSWS has analyzed tens of thousands of search terms, and identified those key phrases and words that had been most likely to place the WSWS on the first or second page of search results. The top 45 search terms previously included “socialism,” “Russian revolution,” “Flint Michigan,” “proletariat,” and “UAW [United Auto Workers].” The top 150 results included the terms “UAW contract,” “rendition” and “Bolshevik revolution.” All of these terms are now blocked.
… (emphasis in original)

In addition to censoring “hate speech” and efforts such as: Google Says It Will Do More to Suppress Terrorist Propaganda, now there is evidence that Google is tampering with search results for simply left-wing websites.

Promote awareness of the censorship by Google, Facebook and Twitter, by using #GCensor, #FCensor, and #TCensor, respectively, for them.

I don’t expect to change the censorship behavior of #GCensor, #FCensor, and #TCensor. The remedy is non-censored alternatives.

All three have proven themselves untrustworthy guardians of free speech.

August 3, 2017

DMCA Complaint As Finding Aid

Filed under: Intellectual Property (IP),Library,Searching — Patrick Durusau @ 6:18 pm

Credit where credit is due, I saw this idea in How to Get Past DMCA Take-Downs in Google Search and report it here, sans the video.

The gist of the idea is that DMCA complaints, found at: Lumen, specify in the case of search engines, links that should not be displayed to users.

In a Google search result, content subject to a DMCA complaint will appear as:

In response to multiple complaints we received under the US Digital Millennium Copyright Act, we have removed 2 results from this page. If you wish, you may read the DMCA complaints that caused the removals at LumenDatabase.org: Complaint, Complaint.

If you follow the complaint links, knowing Google is tracking your following of those links, the complaints list the URLs to be removed from search results.

You can use the listed URLs to verify the presence of illegal content, compile lists of sites with such content, etc.

Enjoy!

PS: I’m adding their RSS feed of new notices. You should too.

May 12, 2017

The Cartoon Bank

Filed under: Art,Image Recognition,Indexing,Searching — Patrick Durusau @ 4:30 pm

The Cartoon Bank by the Condé Nast Collection.

While searching for a cartoon depicting Sean Spicer at a White House news briefing, I encountered The Cartoon Bank.

A great source of instantly recognized cartoons but I’m still searching for one I remember from decades ago. 😉

May 4, 2017

Text Mining For Lawyers (The 55% Google Weaned Lawyers Are Missing)

Filed under: eDiscovery,Law,Searching,Text Mining — Patrick Durusau @ 1:52 pm

Working the Mines: How Text Mining Can Help Create Value for Lawyers by Rees Morrison, Juris Datoris, Legaltech News.

From the post:

To most lawyers, text mining may sound like a magic wand or more hype regarding “artificial intelligence.” In fact, with the right input, text mining is a well-grounded genre of software that can find patterns and insights from large amounts of written material. So, if your law firm or law department has a sizable amount of text from various sources, it can extract value from that collection through powerful software tools.

To help lawyers recognize the potential of text mining and demystify it, this article digs through typical steps of a project. Terms of art related to this domain of software are in bold and, yes, there will be a quiz at the end.

Our example project assumes that your law firm (or law department) has gathered a raft of written comments through an internal survey of lawyers or from clients who have typed their views in a client satisfaction survey (perhaps in response to an open-ended question like “In what ways could we improve?”). All that writing is grist for the mill of text mining!

Great overview of the benefits and complexities of text mining!

I was recently assured by a Google weaned lawyer that natural language searching enabled him and his friends to do a few quick searches to find relevant authorities.

I could not help but point out my review of Blair and Maron’s work that demonstrated while attorneys estimated they recovered 75% of relevant documents, in fact they recovered barely 20%.

No solution returns 100% of the relevant documents for any non-trivial dataset, but leaving 55% on the floor doesn’t inspire confidence.

Especially when searchers consider a relevant result to be success. Depends.

Depends on how many relevant authorities existed and if any were closer to your facts than those found? Among other things.

Is a relevant result your test for research success or the best relevant research result, with a measure of confidence in it’s quality?

March 18, 2017

RegexBuddy (Think Occur Mode for Emacs)

Filed under: Regexes,Searching — Patrick Durusau @ 4:44 pm

RegexBuddy

From the webpage:

RegexBuddy is your perfect companion for working with regular expressions. Easily create regular expressions that match exactly what you want. Clearly understand complex regexes written by others. Quickly test any regex on sample strings and files, preventing mistakes on actual data. Debug without guesswork by stepping through the actual matching process. Use the regex with source code snippets automatically adjusted to the particulars of your programming language. Collect and document libraries of regular expressions for future reuse. GREP (search-and-replace) through files and folders. Integrate RegexBuddy with your favorite searching and editing tools for instant access.

Learn all there is to know about regular expressions from RegexBuddy’s comprehensive documentation and regular expression tutorial.

I was reminded of RegexBuddy when I stumbled on the RegexBuddy Manual in a search result.

The XQuery/XPath regex treatment is far briefer than I would like but at 500+ pages, it’s an impressive bit of work. Even without a copy of RegexBuddy, working through the examples will make you a regex terrorist.

The only unfortunate aspect, for *nix users, is that you need to run RegexBuddy in a Windows VM. 🙁

If you are comfortable with Emacs, Windows or otherwise, then the Occur mode comes to mind. It doesn’t have the visuals of RegexBuddy but then you are accustomed to a power-user environment.

In terms of productivity, it’s hard to beat regexes. I passed along a one liner awk regex tip today to extract content from a “…pile of nonstandard multiply redundant JavaScript infested pseudo html.”

I’ve seen the HTML in question. The description seems a bit generous to me. 😉

Try your hand at regexes and see if your productivity increases!

February 16, 2017

Can You Replicate Your Searches?

Filed under: Bioinformatics,Biomedical,Medical Informatics,Search Engines,Searching — Patrick Durusau @ 4:30 pm

A comment at PubMed raises the question of replicating reported literature searches:

From the comment:

Mellisa Rethlefsen

I thank the authors of this Cochrane review for providing their search strategies in the document Appendix. Upon trying to reproduce the Ovid MEDLINE search strategy, we came across several errors. It is unclear whether these are transcription errors or represent actual errors in the performed search strategy, though likely the former.

For instance, in line 39, the search is “tumour bed boost.sh.kw.ti.ab” [quotes not in original]. The correct syntax would be “tumour bed boost.sh,kw,ti,ab” [no quotes]. The same is true for line 41, where the commas are replaced with periods.

In line 42, the search is “Breast Neoplasms /rt.sh” [quotes not in original]. It is not entirely clear what the authors meant here, but likely they meant to search the MeSH heading Breast Neoplasms with the subheading radiotherapy. If that is the case, the search should have been “Breast Neoplasms/rt” [no quotes].

In lines 43 and 44, it appears as though the authors were trying to search for the MeSH term “Radiotherapy, Conformal” with two different subheadings, which they spell out and end with a subject heading field search (i.e., Radiotherapy, Conformal/adverse events.sh). In Ovid syntax, however, the correct search syntax would be “Radiotherapy, Conformal/ae” [no quotes] without the subheading spelled out and without the extraneous .sh.

In line 47, there is another minor error, again with .sh being extraneously added to the search term “Radiotherapy/” [quotes not in original].

Though these errors are minor and are highly likely to be transcription errors, when attempting to replicate this search, each of these lines produces an error in Ovid. If a searcher is unaware of how to fix these problems, the search becomes unreplicable. Because the search could not have been completed as published, it is unlikely this was actually how the search was performed; however, it is a good case study to examine how even small details matter greatly for reproducibility in search strategies.

A great reminder that replication of searches is a non-trivial task and that search engines are literal to the point of idiocy.

February 14, 2017

We’re Bringing Learning to Rank to Elasticsearch [Merging Properties Query Dependent?]

Filed under: DSL,ElasticSearch,Merging,Search Engines,Searching,Topic Maps — Patrick Durusau @ 8:26 pm

We’re Bringing Learning to Rank to Elasticsearch.

From the post:

It’s no secret that machine learning is revolutionizing many industries. This is equally true in search, where companies exhaust themselves capturing nuance through manually tuned search relevance. Mature search organizations want to get past the “good enough” of manual tuning to build smarter, self-learning search systems.

That’s why we’re excited to release our Elasticsearch Learning to Rank Plugin. What is learning to rank? With learning to rank, a team trains a machine learning model to learn what users deem relevant.

When implementing Learning to Rank you need to:

  1. Measure what users deem relevant through analytics, to build a judgment list grading documents as exactly relevant, moderately relevant, not relevant, for queries
  2. Hypothesize which features might help predict relevance such as TF*IDF of specific field matches, recency, personalization for the searching user, etc.
  3. Train a model that can accurately map features to a relevance score
  4. Deploy the model to your search infrastructure, using it to rank search results in production

Don’t fool yourself: underneath each of these steps lie complex, hard technical and non-technical problems. There’s still no silver bullet. As we mention in Relevant Search, manual tuning of search results comes with many of the same challenges as a good learning to rank solution. We’ll have more to say about the many infrastructure, technical, and non-technical challenges of mature learning to rank solutions in future blog posts.

… (emphasis in original)

A great post as always but of particular interest for topic map fans is this passage:


Many of these features aren’t static properties of the documents in the search engine. Instead they are query dependent – they measure some relationship between the user or their query and a document. And to readers of Relevant Search, this is what we term signals in that book.
… (emphasis in original)

Do you read this as suggesting the merging exhibited to users should depend upon their queries?

That two or more users, with different query histories could (should?) get different merged results from the same topic map?

Now that’s an interesting suggestion!

Enjoy this post and follow the blog for more of same.

(I have a copy of Relevant Search waiting to be read so I had better get to it!)

December 10, 2016

Google Helps Spread Fake News [Fake News & Ad Revenue – Testing]

Filed under: Journalism,News,Reporting,Searching — Patrick Durusau @ 1:31 pm

Google changed its search algorithm and that made it more vulnerable to the spread of fake news by Hannah Roberts.

From the post:

Google’s search algorithm has been changed over the last year to increasingly reward search results based on how likely you are to click on them, multiple sources tell Business Insider.

As a result, fake news now often outranks accurate reports on higher quality websites.

The problem is so acute that Google’s autocomplete suggestions now actually predict that you are searching for fake news even when you might not be, as Business Insider noted on December 5.

Hannah does a great job of setting for the evidence and opinions on the algorithm change but best summarizes it when she says:


The changes to the algorithm now move links up Google’s search results page if Google detects that more people are clicking on them, search experts tell Business Insider.

Just in case you don’t know:

more clicks != credible/useful search results

But it is true:

more clicks = more usage/ad revenue

Google and Facebook find “fake news” profitable. Both will make a great show of suppressing outlying “fake news,” but not so much as to impact profits.

There’s a data science “fake news” project:

Track the suppression of “fake news” by Google and Facebook against the performance of their ad revenue.

Hypotheses: When suppression of “fake news” impinges on ad revenue for more than two consecutive hours, dial back on suppression mechanisms. (ditto for 4, 6, 12 and 24 hour cycles)

Odds on Google and Facebook being transparent regard to suppression of “fake news” and ad revenue to make the results of testing that hypotheses verifiable?

😉

November 22, 2016

Egyptological Museum Search

Filed under: History,Museums,Searching — Patrick Durusau @ 5:06 pm

Egyptological Museum Search

From the post:

The Egyptological museum search is a PHP tool aimed to facilitate locating the descriptions and images of ancient Egyptian objects in online catalogues of major museums. Online catalogues (ranging from selections of highlights to complete digital inventories) are now offered by almost all major museums holding ancient Egyptian items and have become indispensable in research work. Yet the variety of web interfaces and of search rules may overstrain any person performing many searches in different online catalogues.

Egyptological museum search was made to provide a single search point for finding objects by their inventory numbers in major collections of Egyptian antiquities that have online catalogues. It tries to convert user input into search queries recognised by museums’ websites. (Thus, for example, stela Geneva D 50 is searched as “D 0050,” statue Vienna ÄS 5046 is searched as “AE_INV_5046,” and coffin Turin Suppl. 5217 is searched as “S. 05217.”) The following online catalogues are supported:

The search interface uses a short list of aliases for museums.

Once you see/use the interface proper, here, I hope you are interested in volunteering to improve it.

October 20, 2016

Guide to Making Search Relevance Investments, free ebook

Filed under: Searching — Patrick Durusau @ 10:36 pm

Guide to Making Search Relevance Investments, free ebook

Doug Turnbull writes:

How well does search support your business? Are your investments in smarter, more relevant search, paying off? These are business-level questions, not technical ones!

After writing Relevant Search we find ourselves helping clients evaluate their search and discovery investments. Many invest far too little, or struggle to find the areas to make search smarter, unsure of the ROI. Others invest tremendously in supposedly smarter solutions, but have a hard time justifying the expense or understanding the impact of change.

That’s why we’re happy to announce OpenSource Connection’s official search relevance methodology!

The free ebook? Guide to Relevance Investments.

I know, I know, the title is a interest killer.

Think Search ROI. Not something you hear about often but it sounds attractive.

Runs 16 pages and is a blessed relief from the “data has value (unspecified)” mantras.

Search and investment in search is a business decision and this guide nudges you in that direction.

What you do next is up to you.

Enjoy!

October 19, 2016

The Podesta Emails [In Bulk]

Filed under: Government,Hillary Clinton,Searching,Topic Maps,Wikileaks — Patrick Durusau @ 7:53 pm

Wikileaks has been posting:

The Podesta Emails, described as:

WikiLeaks series on deals involving Hillary Clinton campaign Chairman John Podesta. Mr Podesta is a long-term associate of the Clintons and was President Bill Clinton’s Chief of Staff from 1998 until 2001. Mr Podesta also owns the Podesta Group with his brother Tony, a major lobbying firm and is the Chair of the Center for American Progress (CAP), a Washington DC-based think tank.

long enough for them to be decried as “interference” with the U.S. presidential election.

You have two search options, basic:

podesta-basic-search-460

and, advanced:

podesta-adv-search-460

As handy as these search interfaces are, you cannot easily:

  • Analyze relationships between multiple senders and/or recipients of emails
  • Perform entity recognition across the emails as a corpus
  • Process the emails with other software
  • Integrate the emails with other data sources
  • etc., etc.

Michael Best, @NatSecGeek, is posting all the Podesta emails as they are released at: Podesta Emails (zipped).

As of Podesta Emails 13, there is approximately 2 GB of zipped email files available for downloading.

The search interfaces at Wikileaks may work for you, but if you want to get closer to the metal, you have Michael Best to thank for that opportunity!

Enjoy!

September 22, 2016

Apache Lucene 6.2.1 and Apache Solr 6.2.1 Available [Presidential Data Leaks]

Filed under: Lucene,Searching,Solr — Patrick Durusau @ 10:55 am

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/6.2.1

Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/6.2.1

If you aren’t using Lucene/Solr 6.2, here’s your chance to grab the latest bug fixes as well!

Data leaks will accelerate as the US presidential election draws to a close.

What’s your favorite tool for analysis and delivery of data dumps?

Enjoy!

September 20, 2016

NSA: Being Found Beats Searching, Every Time

Filed under: Searching,Topic Maps — Patrick Durusau @ 4:41 pm

Equation Group Firewall Operations Catalogue by Mustafa Al-Bassam.

From the post:

This week someone auctioning hacking tools obtained from the NSA-based hacking group “Equation Group” released a dump of around 250 megabytes of “free” files for proof alongside the auction.

The dump contains a set of exploits, implants and tools for hacking firewalls (“Firewall Operations”). This post aims to be a comprehensive list of all the tools contained or referenced in the dump.

Mustafa’s post is a great illustration of why “being found beats searching, every time.”

Think of the cycles you would have to spend to duplicate this list. Multiple that by the number of people interested in this list. Assuming their time is not valueless, do you start to see the value-add of Mustafa’s post?

Mustafa found each of these items in the data dump and then preserved his finding for the use of others.

It’s not a very big step beyond this preservation to the creation of a container for each of these items, enabling the preservation of other material found on them or related to them.

Search is a starting place and not a destination.

Unless you enjoy repeating the same finding process over and over again.

Your call.

September 19, 2016

Congress.gov Corrects Clinton-Impeachment Search Results

Filed under: Government,Government Data,Searching — Patrick Durusau @ 8:14 am

After posting Congress.gov Search Alert: “…previous total of 261 to the new total of 0.” [Solved] yesterday, pointing out that a change from http:// to https:// altered a search result for Clinton w/in 5 words impeachment, I got an email this morning:

congress-gov-correction-460

I appreciate the update and correction for saved searches, but my point about remote data changing without notice to you remains valid.

I’m still waiting for word on bulk downloads from both Wikileaks and DC Leaks.

Why leak information vital to public discussion and then limit access to search?

September 18, 2016

Introducing arxiv-sanity

Filed under: Archives,Searching,Similarity — Patrick Durusau @ 7:43 pm

Only a small part of Arxiv appears at: http://www.arxiv-sanity.com/ but it is enough to show the feasibility of this approach.

What captures my interest is the potential to substitute/extend the program to use other similarity measures.

Bearing in mind that searching is only the first step towards the acquisition and preservation of knowledge.

PS: I first saw this in a tweet by Data Science Renee.

Congress.gov Search Alert: “…previous total of 261 to the new total of 0.” [Solved]

Filed under: Government,Government Data,Searching — Patrick Durusau @ 11:03 am

Odd message from the Congress.org search alert this AM:

congress-alert-460

Here’s the search I created back in June, 2016:

congress-alert-search-460

My probably inaccurate recall at the moment was I was searching for some quote from the impeachment of Bill Clinton and was too lazy to specify a term of congress, hence:

all congresses – searching for Clinton within five words, impeachment

Fairly trivial search that produced 261 “hits.”

I set the search alert more to explore the search options than any expectation of different future results.

Imagine my surprise to find that all congresses – searching for Clinton within five words, impeachment performed today, results in 0 “hits.”

Suspecting some internal changes to the search interface, I re-entered the search today and got 0 “hits.”

Other saved searches with radically different search results as of today?

This is not, repeat not, the result of some elaborate conspiracy to assist Secretary Clinton in her bid for the presidency.

I do think something fundamental has gone wrong with searching at Congress.gov and it needs to be fixed.

This is an illustration of why Wikileaks, DC Leaks and other data sites should provide easy to access downloads in bulk of their materials.

Providing search interfaces to document collections is a public service, but document collections or access to them can change in ways not transparent to search users. Such as demonstrated by the CIA removing documents previously delivered to the Senate.

Petition Wikileaks, DC Leaks and other data sites for easy bulk downloads.

That will ensure the “evidence” will not shift under your feet and the availability of more sophisticated means of analysis than brute-force search.


Update: The changing from http:// to https:// by the congress.gov site, trashed my save query and using http:// to re-perform the same search.

Using https:// returns the same 261 search results.

What your experience with other saved searches at congress.gov?

July 2, 2016

Five Essential Research Tips for Journalists Using Google

Filed under: Journalism,Searching — Patrick Durusau @ 3:16 pm

Five Essential Research Tips for Journalists Using Google by Temi Adeoye.

This graphic:

google-search-460

does not appear in Temi’s post but rather in a tweet by the International Center For Journalism (ICFJ) about his post.

See Temi’s post for the details but this graphic is a great reminder.

This will make a nice addition to my local page of search links.

April 26, 2016

Visual Searching with Google – One Example – Neo4j – Raspberry Pi

Filed under: Graphs,Neo4j,Searching — Patrick Durusau @ 7:16 pm

Just to show I don’t spend too much time thinking of ways to gnaw on the ankles of Süddeutsche Zeitung (SZ), the hoarders of the Panama Papers, here is my experience with visual searching with Google today.

I saw this image on Twitter:

neo4j-cluster

I assumed that cutting the “clutter” from around the cluster might produce a better result. Besides, the plastic separators looked (to me) to be standard and not custom made.

Here is my cropped image for searching:

neo4j-cluster-cropped

Google responded this looks like: “water.” 😉

OK, so I tried cropping it more just to show the ports, thinking that might turn up similar port arrangements, here’s that image:

neo4j-cluster-ports

Google says: “machinery.” With a number of amusing “similar” images.

BTW, when I tried the full image, the first one, Google says: “electronics.”

OK, so much for Google image searching. What if I try?

Searching on neo4j cluster and raspberry pi (the most likely suspect), my first “hit” had this image:

1st-neo4j-hit

Same height as the search image.

My seventh “hit” has this image:

bruggen-cluster

Same height and logo as the search image. That’s Stefan Armbruster next to the cluster. (He does presentations on building the cluster, but I have yet to find a video of one of those presentations.)

My eight “hit

neo4j-8th

Common wiring color (networking cable), height.

Definitely Raspberry Pi but I wasn’t able to uncover further details.

Very interested in seeing a video of Stefan putting one of these together!

April 16, 2016

UC Davis Spent $175,000.00 To Suppress This Image (let’s disappoint them)

Filed under: Politics,Searching — Patrick Durusau @ 8:19 pm

uc-davis-pic

If you have a few minutes, could you repost this image to your blog and/or Facebook page?

Some references you may want to cite include:

Pepper-sprayed students outraged as UC Davis tried to scrub incident from web by Anita Chabria

Calls for UC Davis chancellor’s ouster grow amid Internet scrubbing controversy by Sarah Parvini and Ruben Vives.

UC Davis Chancellor Faces Calls To Resign Over Pepper Spray Incident (NPR)

Katehi’s effort to alter search engine results backfires spectacularly

UC Davis’ damage control: Dumb-de-dumb-dumb

Reposting the image and links to the posts cited above will help disappoint the mislaid plans to suppress it.

What is more amazing than the chancellor thinking information on the Internet can be suppressed, at least for a paltry $175K, is this pattern will be repeated year after year.

Lying about information known to others is a losing strategy, always.

But that strategy will be picked up by other universities, governments and their agencies, corporations, to say nothing of individuals.

Had UC Davis spent that $175K on better training for its police officers, people would still talk about this event but it would be in contrast to the new and improved way US Davis deals with protesters.

That’s not likely to happen now.

April 14, 2016

Visualizing Data Loss From Search

Filed under: Entity Resolution,Marketing,Record Linkage,Searching,Topic Maps,Visualization — Patrick Durusau @ 3:46 pm

I used searches for “duplicate detection” (3,854) and “coreference resolution” (3290) in “Ironically, Entity Resolution has many duplicate names” [Data Loss] to illustrate potential data loss in searches.

Here is a rough visualization of the information loss if you use only one of those terms:

duplicate-v-coreference-500-clipped

If you search for “duplicate detection,” you miss all the articles shaded in blue.

If you search for “coreference resolution,” you miss all the articles shaded in yellow.

Suggestions for improving this visualization?

It is a visualization that could be performed on client’s data, using their search engine/database.

In order to identify the data loss they are suffering now from search across departments.

With the caveat that not all data loss is bad and/or worth avoiding.

Imaginary example (so far): What if you could demonstrate no overlapping of terminology for two vendors for the United States Army and the Air Force. That is no query terms for one returned useful results for the other.

That is a starting point for evaluating the use of topic maps.

While the divergence in terminologies is a given, the next question is: What is the downside to that divergence? What capability is lost due to that divergence?

Assuming you can identify such a capacity, the next question is to evaluate the cost of reducing and/or eliminating that divergence versus the claimed benefit.

I assume the most relevant terms are going to be those internal to customers and/or potential customers.

Interest in working this up into a client prospecting/topic map marketing tool?


Separately I want to note my discovery (you probably already knew about it) of VennDIS: a JavaFX-based Venn and Euler diagram software to generate publication quality figures. Download here. (Apologies, the publication itself if firewalled.)

The export defaults to 800 x 800 resolution. If you need something smaller, edit the resulting image in Gimp.

It’s a testimony to the software that I was able to produce a useful image in less than a day. Kudos to the software!

April 8, 2016

Lucene/Solr 6.0 Hits The Streets! (There goes the weekend!)

Filed under: Indexing,Lucene,Searching,Solr — Patrick Durusau @ 4:01 pm

From the Lucene PMC:

The Lucene PMC is pleased to announce the release of Apache Lucene 6.0.0 and Apache Solr 6.0.0

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/6.0.0
and Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/6.0.0

Highlights of this Lucene release include:

  • Java 8 is the minimum Java version required.
  • Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
  • Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
  • The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
  • Spatial4j has been updated to a new 0.6 version hosted by locationtech.
  • TermsQuery performance boost by a more aggressive default query caching policy.
  • IndexSearcher’s default Similarity is now changed to BM25Similarity.
  • Easier method of defining custom CharTokenizer instances.

Highlights of this Solr release include:

  • Improved defaults for “Similarity” used in Solr, in order to provide better default experience for new users.
  • Improved “Similarity” defaults for users upgrading: DefaultSimilarityFactory has been removed, implicit default Similarity has been changed to SchemaSimilarityFactory, and SchemaSimilarityFactory has been modified to use BM25Similarity as the default for field types that do not explicitly declare a Similarity.
  • Deprecated GET methods for schema are now accessible through the bulk API. The output has less details and is not backward compatible.
  • Users should set useDocValuesAsStored=”false” to preserve sort order on multi-valued fields that have both stored=”true” and docValues=”true”.
  • Formatted date-times are more consistent with ISO-8601. BC dates are now better supported since they are now formatted with a leading ‘-‘. AD years after 9999 have a leading ‘+’. Parse exceptions have been improved.
  • Deprecated SolrServer and subclasses have been removed, use SolrClient instead.
  • The deprecated configuration in solrconfig.xml has been removed. Users must remove it from solrconfig.xml.
  • SolrClient.shutdown() has been removed, use SolrClient.close() instead.
  • The deprecated zkCredientialsProvider element in solrcloud section of solr.xml is now removed. Use the correct spelling (zkCredentialsProvider) instead.
  • Added support for executing Parallel SQL queries across SolrCloud collections. Includes StreamExpression support and a new JDBC Driver for the SQL Interface.
  • New features and capabilities added to the streaming API.
  • Added support for SELECT DISTINCT queries to the SQL interface.
  • New GraphQuery to enable graph traversal as a query operator.
  • New support for Cross Data Center Replication consisting of active/passive replication for separate SolrClouds hosted in separate data centers.
  • Filter support added to Real-time get.
  • Column alias support added to the Parallel SQL Interface.
  • New command added to switch between non/secure mode in zookeeper.
  • Now possible to use IP fragments in replica placement rules.

For features new to Solr 6.0, be sure to consult the unreleased Solr reference manual. (unreleased as of 8 April 2016)

Happy searching!

Older Posts »

Powered by WordPress