Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 10, 2018

PyCoder’s Weekly Archive 2012-2018 [Indexing Data Set?]

Filed under: Indexing,Python,Search Engines,Searching — Patrick Durusau @ 8:53 pm

PyCoder’s Weekly Archive 2012-2018

Python programmers already know about PyCoder Weekly but if you don’t, it’s a weekly newsletter with headline Python news, discussions, Python jobs, articles & tutorials, projects & code, and events. Yeah, every week!

I mention it too as a potential indexing set for search software. I’m reasoning you are more likely to devote effort to indexing material of interest than out of copyright newspapers. Besides, you will be better able to judge a good search result from a bad one when indexing PyCoder’s Weekly.

Enjoy!

June 12, 2017

FreeDiscovery

Filed under: Python,Scikit-Learn,Search Engines — Patrick Durusau @ 4:30 pm

FreeDiscovery: Open Source e-Discovery and Information Retrieval Engine

From the webpage:

FreeDiscovery is built on top of existing machine learning libraries (scikit-learn) and provides a REST API for information retrieval applications. It aims to benefit existing e-Discovery and information retrieval platforms with a focus on text categorization, semantic search, document clustering, duplicates detection and e-mail threading.

In addition, FreeDiscovery can be used as Python package and exposes several estimators with a scikit-learn compatible API.

Python 3.5+ required.

Homepage has command line examples, with a pointer to: http://freediscovery.io/doc/stable/examples/ for more examples.

The additional examples use a subset of the TREC 2009 legal collection. Cool!

I saw this in a tweet by Lynn Cherny today.

Enjoy!

February 16, 2017

Can You Replicate Your Searches?

Filed under: Bioinformatics,Biomedical,Medical Informatics,Search Engines,Searching — Patrick Durusau @ 4:30 pm

A comment at PubMed raises the question of replicating reported literature searches:

From the comment:

Mellisa Rethlefsen

I thank the authors of this Cochrane review for providing their search strategies in the document Appendix. Upon trying to reproduce the Ovid MEDLINE search strategy, we came across several errors. It is unclear whether these are transcription errors or represent actual errors in the performed search strategy, though likely the former.

For instance, in line 39, the search is “tumour bed boost.sh.kw.ti.ab” [quotes not in original]. The correct syntax would be “tumour bed boost.sh,kw,ti,ab” [no quotes]. The same is true for line 41, where the commas are replaced with periods.

In line 42, the search is “Breast Neoplasms /rt.sh” [quotes not in original]. It is not entirely clear what the authors meant here, but likely they meant to search the MeSH heading Breast Neoplasms with the subheading radiotherapy. If that is the case, the search should have been “Breast Neoplasms/rt” [no quotes].

In lines 43 and 44, it appears as though the authors were trying to search for the MeSH term “Radiotherapy, Conformal” with two different subheadings, which they spell out and end with a subject heading field search (i.e., Radiotherapy, Conformal/adverse events.sh). In Ovid syntax, however, the correct search syntax would be “Radiotherapy, Conformal/ae” [no quotes] without the subheading spelled out and without the extraneous .sh.

In line 47, there is another minor error, again with .sh being extraneously added to the search term “Radiotherapy/” [quotes not in original].

Though these errors are minor and are highly likely to be transcription errors, when attempting to replicate this search, each of these lines produces an error in Ovid. If a searcher is unaware of how to fix these problems, the search becomes unreplicable. Because the search could not have been completed as published, it is unlikely this was actually how the search was performed; however, it is a good case study to examine how even small details matter greatly for reproducibility in search strategies.

A great reminder that replication of searches is a non-trivial task and that search engines are literal to the point of idiocy.

February 14, 2017

We’re Bringing Learning to Rank to Elasticsearch [Merging Properties Query Dependent?]

Filed under: DSL,ElasticSearch,Merging,Search Engines,Searching,Topic Maps — Patrick Durusau @ 8:26 pm

We’re Bringing Learning to Rank to Elasticsearch.

From the post:

It’s no secret that machine learning is revolutionizing many industries. This is equally true in search, where companies exhaust themselves capturing nuance through manually tuned search relevance. Mature search organizations want to get past the “good enough” of manual tuning to build smarter, self-learning search systems.

That’s why we’re excited to release our Elasticsearch Learning to Rank Plugin. What is learning to rank? With learning to rank, a team trains a machine learning model to learn what users deem relevant.

When implementing Learning to Rank you need to:

  1. Measure what users deem relevant through analytics, to build a judgment list grading documents as exactly relevant, moderately relevant, not relevant, for queries
  2. Hypothesize which features might help predict relevance such as TF*IDF of specific field matches, recency, personalization for the searching user, etc.
  3. Train a model that can accurately map features to a relevance score
  4. Deploy the model to your search infrastructure, using it to rank search results in production

Don’t fool yourself: underneath each of these steps lie complex, hard technical and non-technical problems. There’s still no silver bullet. As we mention in Relevant Search, manual tuning of search results comes with many of the same challenges as a good learning to rank solution. We’ll have more to say about the many infrastructure, technical, and non-technical challenges of mature learning to rank solutions in future blog posts.

… (emphasis in original)

A great post as always but of particular interest for topic map fans is this passage:


Many of these features aren’t static properties of the documents in the search engine. Instead they are query dependent – they measure some relationship between the user or their query and a document. And to readers of Relevant Search, this is what we term signals in that book.
… (emphasis in original)

Do you read this as suggesting the merging exhibited to users should depend upon their queries?

That two or more users, with different query histories could (should?) get different merged results from the same topic map?

Now that’s an interesting suggestion!

Enjoy this post and follow the blog for more of same.

(I have a copy of Relevant Search waiting to be read so I had better get to it!)

August 20, 2016

Frinkaic (Simpsons)

Filed under: Humor,Search Engines — Patrick Durusau @ 2:36 pm

Frinkaic

From the webpage:

Frinkiac has nearly 3 million Simpsons screencaps so get to searching for crying out glayvin!

With a link to Morbotron as well.

Once you recover, consider reading: Introducing Frinkiac, The Simpsons Search Engine Built by Rackers by Abe Selig.

Where you aren’t trying to boil the ocean with search, the results can be pretty damned amazing.

July 5, 2016

SEO Tools: The Complete List (153 Free and Paid Tools) [No IEO Tools?]

Filed under: Search Engines,WWW — Patrick Durusau @ 9:15 am

SEO Tools: The Complete List (153 Free and Paid Tools) by Brian Dean.

Updated as of May 20, 2016.

There is a PDF version but that requires sacrifice of your email address, indeterminate waiting for the confirmation email, etc.

The advantage of the PDF version isn’t clear, other than you can print it on marketing’s color printer. Something to cement that close bond between marketing and IT.

With the abundance of search engine optimization tools, have you noticed the lack of index engine optimization (IEO) tools?

When an indexing engine is “optimized,” settings of the indexing engine are altered to produce a “better” result. So far as I know, the data being indexed isn’t normally changed to alter the behavior of the indexing engine.

In contrast to an indexing engine, it is expected data destined for a search engine can and will change/optimize itself to alter the behavior of the search engine.

What if data were index engine optimized, say to distinguish terms with multiple meanings, at the time of indexing? Say articles in the New York Times were paired with vocabulary lists of the names, terms, etc. that appear within them.

Bi-directional links so that an index of the vocabulary lists would at the same time be an index of the articles themselves.

Thoughts?

February 8, 2016

Does HonestSociety.com Not Advertise With Google? (Rigging Search Results)

Filed under: Advertising,Search Engines,Searching — Patrick Durusau @ 11:27 am

I ask about Honestsociety.com because when I search on Google with the string:

honest society member

I get 82,100,000 “hits” and the first page is entirely, honor society stuff.

No, “did you mean,” or “displaying results for…”, etc.

Not a one.

Top of the second page of results did have a webpage that mentions honestsociety.com, but not their home site.

I can’t recall seeing an Honestsociety ad with Google and thought perhaps one of you might.

Lacking such ads, my seat of the pants explanation for “honest society member” returning the non-responsive “honor society” listing isn’t very generous.

What anomalies have you observed in Google (or other) search results?

What searches would you use to test ranking in search results by advertiser with Google versus non-advertiser with Google?

Rigging Searches

For my part, it isn’t a question of whether search results are rigged or not, but rather are they rigged the way I or my client prefers?

Or to say it in a positive way: All searches are rigged. If you think otherwise, you haven’t thought very deeply about the problem.

Take library searches for example. Do you think they are “fair” in some sense of the word?

Hmmm, would you agree that the collection practices of a library will give a user an impression of the literature on a subject?

So the search itself isn’t “rigged,” but the data underlying the results certainly influences the outcome.

If you let me pick the data, I can guarantee whatever search result you want to present. Ditto for the search algorithms.

The best we can do is make our choices with regard to the data and algorithms explicit, so that others accept our “rigged” data or choose to “rig” it differently.

February 3, 2016

Reverse Image Search (TinEye) [Clue to a User Topic Map Interface?]

TinEye was mentioned in a post I wrote in 2015, Baltimore Burning and Verification, but I did not follow up at the time.

Unlike some US intelligence agencies, TinEye has a cool logo:

TinEye

Free registration enables you to share search results with others, an important feature for news teams.

I only tested the plugin for Chrome, but it offers useful result options:

tineye-options

Once installed, use by hovering over an image in your browser, right “click” and select “Search image on TinEye.” Your results will be presented as set under options.

Clue to User Topic Map Interface

That is a good example of how one version of a topic map interface should work. Select some text, right “click” and “Search topic map ….(preset or selection)” with configurable result display.

That puts you into interaction with the topic map, which can offer properties to enable you to refine the identification of a subject of interest and then a merged presentation of the results.

As with a topic map, all sorts of complicated things are happening in the background with the TinEye extension.

But as a user, I’m interested in the results that FireEye presents not how it got them.

I used to say “more interested” to indicate I might care how useful results came to be assembled. That’s a pretension that isn’t true.

It might be true in some particular case, but for the vast majority of searches, I just want the (uncensored Google) results.

US Intelligence Community Logo for Same Capability

I discovered the most likely intelligence community logo for a similar search program:

peeping-tom_2734636b

The answer to the age-old question of “who watches the watchers?” is us. Which watchers are you watching?

February 2, 2016

How to Build a TimesMachine [New York Times from 1851-2001]

Filed under: History,News,Search Engines — Patrick Durusau @ 1:59 pm

How to Build a TimesMachine by Jane Cotler and Evan Sandaus.

From the post:

At the beginning of this year, we quietly expanded TimesMachine, our virtual microfilm reader, to include every issue of The New York Times published between 1981 and 2002. Prior to this expansion, TimesMachine contained every issue published between 1851 and 1980, which consisted of over 11 million articles spread out over approximately 2.5 million pages. The new expansion adds an additional 8,035 complete issues containing 1.4 million articles over 1.6 million pages.

the_time_machine

Creating and expanding TimesMachine presented us with several interesting technical challenges, and in this post we’ll describe how we tackled two. First, we’ll discuss the fundamental challenge with TimesMachine: efficiently providing a user with a scan of an entire day’s newspaper without requiring the download of hundreds of megabytes of data. Then, we’ll discuss a fascinating string matching problem we had to solve in order to include articles published after 1980 in TimesMachine.

It’s not all the extant Hebrew Bible witnesses, both images and transcription, or all extant cuneiform tablets with existing secondary literature, but if you are interested in more recent events, what a magnificent resource!

Tesseract-ocr gets a shout-out and link for its use on the New York Times archives.

The string matching solution for search shows the advantages of finding a “nearly perfect” solution.

January 24, 2016

A Comprehensive Guide to Google Search Operators

Filed under: News,Reporting,Search Engines,Search Interface,Searching — Patrick Durusau @ 5:25 pm

A Comprehensive Guide to Google Search Operators by Marcela De Vivo.

From the post:

Google is, beyond question, the most utilized and highest performing search engine on the web. However, most of the users who utilize Google do not maximize their potential for getting the most accurate results from their searches.

By using Google Search Operators, you can find exactly what you are looking for quickly and effectively just by changing what you input into the search bar.

If you are searching for something simple on Google like [Funny cats] or [Francis Ford Coppola Movies] there is no need to use search operators. Google will return the results you are looking for effectively no matter how you input the words.

Note: Throughout this article whatever is in between these brackets [ ] is what is being typed into Google.

When [Francis Ford Coppola Movies] is typed into Google, Google reads the query as Francis AND Ford AND Coppola AND Movies. So Google will return pages that have all those words in them, with the most relevant pages appearing first. Which is fine when you’re searching for very broad things, but what if you’re trying to find something specific?

What happens when you’re trying to find a report on the revenue and statistics from the United States National Park System in 1995 from a reliable source, and no using Wikipedia.

I can’t say that Marcela’s guide is comprehensive for Google in 2016, because I am guessing the post was written in 2013. Hard to say if early or late 2013 without more research than I am willing donate. Dating posts makes it easy for readers to spot current or past-use-date information.

For the information that is present, this is a great presentation and list of operators.

One way to use this post is to work through every example but use terms from your domain.

If you are mining the web for news reporting, compete against yourself on successive stories or within a small group.

Great resource for creating a search worksheet for classes.

December 12, 2015

Previously Unknown Hip Replacement Side Effect: Web Crawler Writing In Python

Filed under: Python,Search Engines,Searching,Webcrawler — Patrick Durusau @ 8:02 pm

Crawling the web with Python 3.x by Doug Mahugh.

From the post:

These days, most everyone is familiar with the concept of crawling the web: a piece of software that systematically reads web pages and the pages they link to, traversing the world-wide web. It’s what Google does, and countless tech firms crawl web pages to accomplish tasks ranging from searches to archiving content to statistical analyses and so on. Web crawling is a task that has been automated by developers in every programming language around, many times — for example, a search for web crawling source code yields well over a million hits.

So when I recently came across a need to crawl some web pages for a project I’ve been working on, I figured I could just go find some source code online and hack it into what I need. (Quick aside: the project is a Python library for managing EXIF metadata on digital photos. More on that in a future blog post.)

But I spent a couple of hours searching and playing with the samples I found, and didn’t get anywhere. Mostly because I’m working in Python version 3, and the most popular Python web crawling code is Scrapy, which is only available for Python 2. I found a few Python 3 samples, but they all seemed to be either too trivial (not avoiding re-scanning the same page, for example) or too needlessly complex. So I decided to write my own Python 3.x web crawler, as a fun little learning exercise and also because I need one.

In this blog post I’ll go over how I approached it and explain some of the code, which I posted on GitHub so that others can use it as well.

Doug has been writing publicly about his hip replacement surgery so I don’t think this has any privacy issues. 😉

I am interested to see what he writes once he is fully recovered.

My contacts at the American Medical Association disavow any knowledge of hip replacement surgery driving patients to write in Python and/or to write web crawlers.

I suppose there could be liability implications, especially for C/C++ programmers who lose their programming skills except for Python following such surgery.

Still, glad to hear Doug has been making great progress and hope that it continues!

December 10, 2015

Nutch 1.11 Release

Filed under: Nutch,Search Engines — Patrick Durusau @ 3:56 pm

Nutch 1.11 Release

From the homepage:

The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.11, we advise all current users and developers of the 1.X series to upgrade to this release.

This release is the result of many months of work and around 100 issues addressed. For a complete overview of these issues please see the release report.

As usual in the 1.X series, release artifacts are made available as both source and binary and also available within Maven Central as a Maven dependency. The release is available from our DOWNLOADS PAGE.

I have played with Nutch but never really taken advantage of it as a day-to-day discovery tool.

I don’t need to boil the Internet ocean to cover well over 90 to 95% of all the content that interests me on a day to day basis.

Moreover, searching a limited part of the Internet would enable things like granular date sorting and not within a week, month, last year.

Not to mention I could abandon the never sufficiently damned page-rank sorting of search results. Maybe you need to look “busy” as you sort through search result cruft, time and again, but I have other tasks to fill my time.

Come to think of it, as I winnow through search results, I could annotate, tag, mark, choose your terminology, such that a subsequent search turns up my evaluation, ranking, preference among those items.

Try that with Google, Bing or other general search appliance.

This won’t be an end of 2015 project, mostly because I am trying to learn a print dictionary layout from the 19th century for representation in granular markup and other tasks are at hand.

However, in early 2016 I will grab the Nutch 1.11 release and see if I can put it into daily use. More on that in 2016.

BTW, what projects are you going to be pursuing in 2016?

October 7, 2015

Search Correctness? [Right to be Forgotten, Hiding in Plain Sight]

Filed under: Search Engines,Searching — Patrick Durusau @ 9:32 pm

Everyone has an opinion about “political correctness,” but did you know that Google has “search correctness?”

This morning I was attempting, unsuccessfully, to search for:

fetch-search

Here is a screen shot of my results:

fetch-xml-results

I happen to know that “fetch:xml” is an actual string in Google indexed documents because it was a Google search that found the page (using other search terms) where that string exists!

I wanted to search beyond that page for other instances of “fetch:xml.” Surprise, surprise.

Prior to posting this, as a sanity check, I searched for “dc:title.” And got hits!? What’s going on?

If you look at: Metatags.org, you will find them using:

DC.title, DC.creator, etc. Replacing the “:” (colon) with a “.” (period).

That solution is a non-starter for me because I have no control over how other people report fetch:xml. One assumes they will use the same string as appears in the documentation.

As annoying as this is, perhaps it is the solution to the EU’s right to be forgotten problem.

People who want to be forgotten can change their names to punctuation that Google does not index.

Once their names are entirely punctuation, these shy individuals will never be found in a Google search.

Works across the globe with no burden on search engine operators.

September 23, 2015

SymbolHound

Filed under: Search Engines — Patrick Durusau @ 8:01 pm

SymbolHound

From the about page:

SymbolHound is a search engine that doesn’t ignore special characters. This means you can easily search for symbols like &, %, and π. We hope SymbolHound will help programmers find information about their chosen languages and frameworks more easily.

SymbolHound was started by David Crane and Thomas Feldtmose while students at the University of Connecticut. Another project by them is Toned Ear, a website for ear training.

I first saw SymbolHound mentioned in a discussion of delimiter options for a future XQuery feature.

For syntax drafting you need to have SymbolHound on your toolbar, not just bookmarked.

August 20, 2015

Censorship of Google Spreads to the UK

Filed under: Censorship,Search Engines — Patrick Durusau @ 7:33 pm

Google ordered to remove links to ‘right to be forgotten’ removal stories by Samuel Gibbs.

From the post:

Google has been ordered by the Information Commissioner’s office to remove nine links to current news stories about older reports which themselves were removed from search results under the ‘right to be forgotten’ ruling.

The search engine had previously removed links relating to a 10 year-old criminal offence by an individual after requests made under the right to be forgotten ruling. Removal of those links from Google’s search results for the claimant’s name spurred new news posts detailing the removals, which were then indexed by Google’s search engine.

Google refused to remove links to these later news posts, which included details of the original criminal offence, despite them forming part of search results for the claimant’s name, arguing that they are an essential part of a recent news story and in the public interest.

Google now has 35 days from the 18 August to remove the links from its search results for the claimant’s name. Google has the right to appeal to the General Regulatory Chamber against the notice.

It is spectacularly sad that this wasn’t the gnomes that run the EU bureaucracy, looking for something pointless to occupy their time and the time of others.

No, this was the Information Commissioner’s Office:

The UK’s independent authority set up to uphold information rights in the public interest, promoting openness by public bodies and data privacy for individuals.

Despite this being story of public interest and conceding that the public has an interest in finding stories about delisted searches:

27. Journalistic context — The Commissioner accepts that the search results in this case relate to journalistic content. Further, the Commissioner does not dispute that journalistic content relating to decisions to delist search results may be newsworthy and in the public interest. However, that interest can be adequately and properly met without a search made on the basis of the complaint’s name providing links to articles which reveal information about the complainant’s spent conviction.

The decision goes on to give Google 35 days from the 18th of August to delist websites which appear in search results on the basis of a censored name. And of course, the links are censored as well.

Despite having failed to fix the StageFright vulnerability which impacts 950 million Android users, the Information Commissioner’s Office wants to fine-tune the search results for a given name to exclude particular websites.

In the not too distant future, the search results displayed in Google will represent a vetting by the most oppressive regimes in the world to the silliest.

Google should not appeal this decision but simply ignore it.

It is an illegal and illegitimate intrusion both on the public’s right to search by any means or manner it chooses and Google’s right to truthfully report the results of searches.

July 10, 2015

American Right To Be A Google Censor?

Filed under: Censorship,Search Engines — Patrick Durusau @ 9:12 am

The “right to be forgotten” intellectual confusion has spread from the EU to the United States. John Zorabedian reports in: Do Americans have the same right as Europeans to be “forgotten” by Google? that Consumer Watchdog has filed a complaint with the FTC, seeking the mis-named “right to be forgotten” for Americans.

The “right to be forgotten” is deeply problematic for many reasons, among which are:

  1. If enforced, the offending link is removed from Google’s search results. The original and presumably offending source material persists. At best, the right is: “a right to not be found in Google search results.”
  2. As “a right to not be found in Google search results,” it is a remarkably limited right, since it works only in jurisdictions that establish that right.
  3. As “a right to not be found in Google search results,” it could lead to varying results as rules to be “forgotten” vary from one jurisdiction to another.
  4. As “a right to not be found in Google search results,” if it is given extra-territorial reach, would lead to world-wide censorship of Google search results. (The EU may be concerned with the sensitivities of Balkan war criminals but many outside the EU are not.)
  5. As “a right to not be found in Google search results,” is on its face limited to Google, opening up the marketplace for sites that remember forgotten results and plugins that supplement Google search results with forgotten results.
  6. As “a right to not be found in Google search results,” imposes an administrative overhead on Google that is not imposed on other search engines. Not to mention additional judicial proceedings if denial of a request by Google leads to litigation to force removal of materials from a Google search result.

At this point, the “complaint” of Consumer Watchdog isn’t anything more than a letter to the FTC. It appears no where in official FTC listings. Hopefully it will stay that way.

July 7, 2015

Google search poisoning – old dogs learn new tricks

Filed under: Search Analytics,Search Engines,Searching — Patrick Durusau @ 12:24 pm

Google search poisoning – old dogs learn new tricks by Dmitry Samosseiko.

From the post:

These days, every company knows that having its website appear at the top of Google’s results for relevant keyword searches makes a big difference in traffic and helps the business. Numerous search engine optimization (SEO) techniques have existed for years and provided marketers with ways to climb up the PageRank ladder.

In a nutshell, to be popular with Google, your website has to provide content relevant to specific search keywords and also to be linked to by a high number of reputable and relevant sites. (These act as recommendations, and are rather confusingly known as “back links,” even though it’s not your site that is doing the linking.)

Google’s algorithms are much more complex than this simple description, but most of the optimization techniques still revolve around those two goals. Many of the optimization techniques that are being used are legitimate, ethical and approved by Google and other search providers. But there are also other, and at times more effective, tricks that rely on various forms of internet abuse, with attempts to fool Google’s algorithms through forgery, spam and even hacking.

One of the techniques used to mislead Google’s page indexer is known as cloaking. A few days ago, we identified what we believe is a new type of cloaking that appears to work very well in bypassing Google’s defense algorithms.

Dmitry reports that Google was notified of this new form of cloaking so it may be work for much longer.

I first read about this in Notes from SophosLabs: Poisoning Google search results and getting away with it by Paul Ducklin.

I’m not sure I would characterize this as “poisoning Google search.” Altering a Google search result to be sure but poisoning implies that standard Google search results represent some “standard” of search results. Google search results are the outcome of undisclosed algorithms run on undisclosed content, subject to undisclosed processing of the scores from processing content with algorithms, and output with more undisclosed processing of the results.

Just putting it into large containers, I see four large boxes of undisclosed algorithms and content, all of which impact the results presented as Google Search results. Are Google Search results the standard output from four or more undisclosed processing steps of unknown complexity?

That doesn’t sound like much of a standard to me.

You?

June 26, 2015

DuckDuckGo search traffic soars 600% post-Snowden

Filed under: Privacy,Search Engines,Searching — Patrick Durusau @ 7:42 pm

DuckDuckGo search traffic soars 600% post-Snowden by Lee Munson.

From the post:

When Gabriel Weinberg launched a new search engine in 2008 I doubt even he thought it would gain any traction in an online world dominated by Google.

Now, seven years on, Philadelphia-based startup DuckDuckGo – a search engine that launched with a promise to respect user privacy – has seen a massive increase in traffic, thanks largely to ex-NSA contractor Edward Snowden’s revelations.

Since Snowden began dumping documents two years ago, DuckDuckGo has seen a 600% increase in traffic (but not in China – just like its larger brethren, its blocked there), thanks largely to its unique selling point of not recording any information about its users or their previous searches.

Such a huge rise in traffic means DuckDuckGo now handles around 3 billion searches per year.

DuckDuckGo does not track its users. Instead, it makes money off of displaying key word (from your search string) based ads.

Hmmm, what if instead of key words from your search string, you pre-qualified yourself for ads?

Say for example I have a topic map fragment that pre-qualifies me for new books on computer science, break baking, and waxed dental floss. When I use a search site, it uses those “topics” or key words to display ads to me.

That avoids displaying to me ads for new cars (don’t own one, don’t want one), hair replacement ads (not interested) and ski resorts (don’t ski).

Advertisers benefit because their ads are displayed to people who have qualified themselves as interested in their products. I don’t know what the difference in click-through rate would be but I suspect it would be substantial.

Thoughts?

June 10, 2015

BASE – Bielefeld Academic Search Engine

Filed under: OAI,Search Engines — Patrick Durusau @ 1:48 pm

BASE – Bielefeld Academic Search Engine

From the post:

BASE is one of the world’s most voluminous search engines especially for academic open access web resources. BASE is operated by Bielefeld University Library.

As the open access movement grows and prospers, more and more repository servers come into being which use the “Open Archives Initiative Protocol for Metadata Harvesting” (OAI-PMH) for providing their contents. BASE collects, normalises, and indexes these data. BASE provides more than 70 million documents from more than 3,000 sources. You can access the full texts of about 70% of the indexed documents. The index is continuously enhanced by integrating further OAI sources as well as local sources. Our OAI-PMH Blog communicates information related to harvesting and aggregating activities performed for BASE.

One feature of the search interface that is particularly appealing is the ability to “boost” open access documents, request verbatim search, request additional word forms, and to invoke multilingual synonyms (Eurovoc Thesaurus).

I first saw this in a tweet by Amanda French

June 2, 2015

Relevant Search

Filed under: ElasticSearch,Relevance,Search Engines,Solr — Patrick Durusau @ 3:21 pm

Relevant Search – With examples using Elasticsearch and Solr by Doug Turnbull and John Berryman.

From the webpage:

Users expect search to be simple: They enter a few terms and expect perfectly-organized, relevant results instantly. But behind this simple user experience, complex machinery is at work. Whether using Solr, Elasticsearch, or another search technology, the solution is never one size fits all. Returning the right search results requires conveying domain knowledge and business rules in the search engine’s data structures, text analytics, and results ranking capabilities.

Relevant Search demystifies relevance work. Using Elasticsearch, it teaches you how to return engaging search results to your users, helping you understand and leverage the internals of Lucene-based search engines. Relevant Search walks through several real-world problems using a cohesive philosophy that combines text analysis, query building, and score shaping to express business ranking rules to the search engine. It outlines how to guide the engineering process by monitoring search user behavior and shifting the enterprise to a search-first culture focused on humans, not computers. You’ll see how the search engine provides a deeply pluggable platform for integrating search ranking with machine learning, ontologies, personalization, domain-specific expertise, and other enriching sources.

  • Creating a foundation for Lucene-based search (Solr, Elasticsearch) relevance internals
  • Bridging the field of Information Retrieval and real-world search problems
  • Building your toolbelt for relevance work
  • Solving search ranking problems by combining text analysis, query building, and score shaping
  • Providing users relevance feedback so that they can better interact with search
  • Integrating test-driven relevance techniques based on A/B testing and content expertise
  • Exploring advanced relevance solutions through custom plug-ins and machine learning

Now imagine relevancy searching where a topic map contains multiple subject identifications for a single subject, from different perspectives.

Relevant Search is in early release but the sooner you participate, the fewer errata there will be in the final version.

May 18, 2015

FreeSearch

Filed under: Computer Science,Search Behavior,Search Engines,Searching — Patrick Durusau @ 5:52 pm

FreeSearch

From the “about” page:

The FreeSearch project is a search system on top of DBLP data provided by Michael Ley. FreeSearch is a joint project of the L3S Research Center and iSearch IT Solutions GmbH.

In this project we develop new methods for simple literature search that works on any catalogs, without requiring in-depth knowledge of the metadata schema. The system helps users proactively and unobtrusively by guessing at each step what the user’s real information need is and providing precise suggestions.

A more detailed description of the system can be found in this publication: FreeSearch – Literature Search in a Natural Way.

You can choose to search across:

DBLP (4,552,889 documents)

TIBKat (2,079,012 documents)

CiteSeer (1,910,493 documents)

BibSonomy (448,166 documents)

Enjoy!

April 19, 2015

DARPA: MEMEX (Domain-Specific Search) Drops!

Filed under: DARPA,Search Engines,Searching,Tor — Patrick Durusau @ 12:49 pm

The DARPA MEMEX project is now listed on its Open Catalog page!

Forty (40) separate components listed by team, project, category, link to code, description and license. Each sortable of course.

No doubt DARPA has held back some of its best work but looking over the descriptions, there are no bojums or quantum leaps beyond current discussions in search technology. How far you can push the released work beyond its current state is an exercise for the reader.

Machine learning is mentioned in the descriptions for DeepDive, Formasaurus and SourcePin. No explicit mention of deep learning, at least in the descriptions.

If you prefer to not visit the DARPA site, I have gleaned the essential information (project, link to code, description) into the following list:

  • ACHE: ACHE is a focused crawler. Users can customize the crawler to search for different topics or objects on the Web. (Java)
  • Aperture Tile-Based Visual Analytics: New tools for raw data characterization of ‘big data’ are required to suggest initial hypotheses for testing. The widespread use and adoption of web-based maps has provided a familiar set of interactions for exploring abstract large data spaces. Building on these techniques, we developed tile based visual analytics that provide browser-based interactive visualization of billions of data points. (JavaScript/Java)
  • ArrayFire: ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its array-based function set makes parallel programming simple. ArrayFire’s multiple backends (CUDA, OpenCL, and native CPU) make it platform independent and highly portable. A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving users valuable time and lowering development costs. (C, C++, Python, Fortran, Java)
  • Autologin: AutoLogin is a utility that allows a web crawler to start from any given page of a website (for example the home page) and attempt to find the login page, where the spider can then log in with a set of valid, user-provided credentials to conduct a deep crawl of a site to which the user already has legitimate access. AutoLogin can be used as a library or as a service. (Python)
  • CubeTest: Official evaluation metric used for evaluation for TREC Dynamic Domain Track. It is a multiple-dimensional metric that measures the effectiveness of complete a complex and task-based search process. (Perl)
  • Data Microscopes: Data Microscopes is a collection of robust, validated Bayesian nonparametric models for discovering structure in data. Models for tabular, relational, text, and time-series data can accommodate multiple data types, including categorical, real-valued, binary, and spatial data. Inference and visualization of results respects the underlying uncertainty in the data, allowing domain experts to feel confident in the quality of the answers they receive. (Python, C++)
  • DataWake: The Datawake project consists of various server and database technologies that aggregate user browsing data via a plug-in using domain-specific searches. This captured, or extracted, data is organized into browse paths and elements of interest. This information can be shared or expanded amongst teams of individuals. Elements of interest which are extracted either automatically, or manually by the user, are given weighted values. (Python/Java/Scala/Clojure/JavaScript)
  • DeepDive: DeepDive is a new type of knowledge base construction system that enables developers to analyze data on a deeper level than ever before. Many applications have been built using DeepDive to extract data from millions of documents, Web pages, PDFs, tables, and figures. DeepDive is a trained system, which means that it uses machine-learning techniques to incorporate domain-specific knowledge and user feedback to improve the quality of its analysis. DeepDive can deal with noisy and imprecise data by producing calibrated probabilities for every assertion it makes. DeepDive offers a scalable, high-performance learning engine. (SQL, Python, C++)
  • DIG: DIG is a visual analysis tool based on a faceted search engine that enables rapid, interactive exploration of large data sets. Users refine their queries by entering search terms or selecting values from lists of aggregated attributes. DIG can be quickly configured for a new domain through simple configuration. (JavaScript)
  • Dossier Stack: Dossier Stack provides a framework of library components for building active search applications that learn what users want by capturing their actions as truth data. The frameworks web services and javascript client libraries enable applications to efficiently capture user actions such as organizing content into folders, and allows back end algorithms to train classifiers and ranking algorithms to recommend content based on those user actions. (Python/JavaScript/Java)
  • Dumpling: Dumpling implements a novel dynamic search engine which refines search results on the fly. Dumpling utilizes the Winwin algorithm and the Query Change retrieval Model (QCM) to infer the user’s state and tailor search results accordingly. Dumpling provides a friendly user interface for user to compare the static results and dynamic results. (Java, JavaScript, HTML, CSS)
  • FacetSpace: FacetSpace allows the investigation of large data sets based on the extraction and manipulation of relevant facets. These facets may be almost any consistent piece of information that can be extracted from the dataset: names, locations, prices, etc… (JavaScript)
  • Formasaurus: Formasaurus is a Python package that tells users the type of an HTML form: is it a login, search, registration, password recovery, join mailing list, contact form or something else. Under the hood it uses machine learning. (Python)
  • Frontera: Frontera (formerly Crawl Frontier) is used as part of a web crawler, it can store URLs and prioritize what to visit next. (Python)
  • HG Profiler: HG Profiler is a tool that allows users to take a list of entities from a particular source and look for those same entities across a pre-defined list of other sources. (Python)
  • Hidden Service Forum Spider: An interactive web forum analysis tool that operates over Tor hidden services. This tool is capable of passive forum data capture and posting dialog at random or user-specifiable intervals. (Python)
  • HSProbe (The Tor Hidden Service Prober): HSProbe is a python multi-threaded STEM-based application designed to interrogate the status of Tor hidden services (HSs) and extracting hidden service content. It is an HS-protocol savvy crawler, that uses protocol error codes to decide what to do when a hidden service is not reached. HSProbe tests whether specified Tor hidden services (.onion addresses) are listening on one of a range of pre-specified ports, and optionally, whether they are speaking over other specified protocols. As of this version, support for HTTP and HTTPS is implemented. Hsprobe takes as input a list of hidden services to be probed and generates as output a similar list of the results of each hidden service probed. (Python)
  • ImageCat: ImageCat analyses images and extracts their EXIF metadata and any text contained in the image via OCR. It can handle millions of images. (Python, Java)
  • ImageSpace: ImageSpace provides the ability to analyze and search through large numbers of images. These images may be text searched based on associated metadata and OCR text or a new image may be uploaded as a foundation for a search. (Python)
  • Karma: Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs. Users integrate information by modelling it according to an ontology of their choice using a graphical user interface that automates much of the process. (Java, JavaScript)
  • LegisGATE: Demonstration application for running General Architecture Text Engineering over legislative resources. (Java)
  • Memex Explorer: Memex Explorer is a pluggable framework for domain specific crawls, search, and unified interface for Memex Tools. It includes the capability to add links to other web-based apps (not just Memex) and the capability to start, stop, and analyze web crawls using 2 different crawlers – ACHE and Nutch. (Python)
  • MITIE: Trainable named entity extractor (NER) and relation extractor. (C)
  • Omakase: Omakase provides a simple and flexible interface to share data, computations, and visualizations between a variety of user roles in both local and cloud environments. (Python, Clojure)
  • pykafka: pykafka is a Python driver for the Apache Kafka messaging system. It enables Python programmers to publish data to Kafka topics and subscribe to existing Kafka topics. It includes a pure-Python implementation as well as an optional C driver for increased performance. It is the only Python driver to have feature parity with the official Scala driver, supporting both high-level and low-level APIs, including balanced consumer groups for high-scale uses. (Python)
  • Scrapy Cluster: Scrapy Cluster is a scalable, distributed web crawling cluster based on Scrapy and coordinated via Kafka and Redis. It provides a framework for intelligent distributed throttling as well as the ability to conduct time-limited web crawls. (Python)
  • Scrapy-Dockerhub: Scrapy-Dockerhub is a deployment setup for Scrapy spiders that packages the spider and all dependencies into a Docker container, which is then managed by a Fabric command line utility. With this setup, users can run spiders seamlessly on any server, without the need for Scrapyd which typically handles the spider management. With Scrapy-Dockerhub, users issue one command to deploy spider with all dependencies to the server and second command to run it. There are also commands for viewing jobs, logs, etc. (Python)
  • Shadow: Shadow is an open-source network simulator/emulator hybrid that runs real applications like Tor and Bitcoin over a simulated Internet topology. It is light-weight, efficient, scalable, parallelized, controllable, deterministic, accurate, and modular. (C)
  • SMQTK: Kitware’s Social Multimedia Query Toolkit (SMQTK) is an open-source service for ingesting images and video from social media (e.g. YouTube, Twitter), computing content-based features, indexing the media based on the content descriptors, querying for similar content, and building user-defined searches via an interactive query refinement (IQR) process. (Python)
  • SourcePin: SourcePin is a tool to assist users in discovering websites that contain content they are interested in for a particular topic, or domain. Unlike a search engine, SourcePin allows a non-technical user to leverage the power of an advanced automated smart web crawling system to generate significantly more results than the manual process typically does, in significantly less time. The User Interface of SourcePin allows users to quickly across through hundreds or thousands of representative images to quickly find the websites they are most interested in. SourcePin also has a scoring system which takes feedback from the user on which websites are interesting and, using machine learning, assigns a score to the other crawl results based on how interesting they are likely to be for the user. The roadmap for SourcePin includes integration with other tools and a capability for users to actually extract relevant information from the crawl results. (Python, JavaScript)
  • Splash: Lightweight, scriptable browser as a service with an HTTP API. (Python)
  • streamparse: streamparse runs Python code against real-time streams of data. It allows users to spin up small clusters of stream processing machines locally during development. It also allows remote management of stream processing clusters that are running Apache Storm. It includes a Python module implementing the Storm multi-lang protocol; a command-line tool for managing local development, projects, and clusters; and an API for writing data processing topologies easily. (Python, Clojure)
  • TellFinder: TellFinder provides efficient visual analytics to automatically characterize and organize publicly available Internet data. Compared to standard web search engines, TellFinder enables users to research case-related data in significantly less time. Reviewing TellFinder’s automatically characterized groups also allows users to understand temporal patterns, relationships and aggregate behavior. The techniques are applicable to various domains. (JavaScript, Java)
  • Text.jl: Text.jl provided numerous tools for text processing optimized for the Julia language. Functionality supported include algorithms for feature extraction, text classification, and language identification. (Julia)
  • TJBatchExtractor: Regex based information extractor for online advertisements (Java).
  • Topic: This tool takes a set of text documents, filters by a given language, and then produces documents clustered by topic. The method used is Probabilistic Latent Semantic Analysis (PLSA). (Python)
  • Topic Space: Tool for visualization for topics in document collections. (Python)
  • Tor: The core software for using and participating in the Tor network. (C)
  • The Tor Path Simulator (TorPS): TorPS quickly simulates path selection in the Tor traffic-secure communications network. It is useful for experimental analysis of alternative route selection algorithms or changes to route selection parameters. (C++, Python, Bash)
  • TREC-DD Annotation: This Annotation Tool supports the annotation task in creating ground truth data for TREC Dynamic Domain Track. It adopts drag and drop approach for assessor to annotate passage-level relevance judgement. It also supports multiple ways of browsing and search in various domains of corpora used in TREC DD. (Python, JavaScript, HTML, CSS)

Beyond whatever use you find for the software, it is also important in terms of what capabilities are of interest to DARPA and by extension to those interested in militarized IT.

April 15, 2015

Maybe Friday (17th April) or Monday (20th April) DARPA – Dark Net

Filed under: DARPA,Search Analytics,Search Engines,Search Requirements,Uncategorized — Patrick Durusau @ 2:20 pm

Memex In Action: Watch DARPA Artificial Intelligence Search For Crime On The ‘Dark Web’ by Thomas Fox-Brewster.

Is DARPA’s Memex search engine a Google-killer? by Mark Stockleyhttps

A couple of “while you wait” pieces to read while you expect part of the DARPA Memex project to appear on its Open Catalog page, either this coming Friday (17th of April) or Monday (20th of April).

Fox-Brewster has video of a part of the system that:

It is trying to overcome one of the main barriers to modern search: crawlers can’t click or scroll like humans do and so often don’t collect “dynamic” content that appears upon an action by a user.

If you think searching is difficult now, with an estimated 5% of the web being indexed, just imagine bumping that up 10X or more.

Entirely manual indexing is already impossible and you have experienced the short comings of page ranking.

Perhaps the components of Memex will enable us to step towards a fusion of human and computer capabilities to create curated information resources.

Imagine an electronic The Art of Computer Programming that has several human experts per chapter who are assisted by deep searching and updating references and the text on an ongoing basis? So readers don’t have to weed through all the re-inventions of particular algorithms across numerous computer and math journals.

Or perhaps a more automated search of news reports so the earliest/most complete report is returned with the notation: “There are NNNNNN other, later and less complete versions of this story.” It isn’t that every major paper adds value, more often just content.

BTW, the focus on the capabilities of the search engine, as opposed to the analysis of those results most welcome.

See my post on its post-search capabilities: DARPA Is Developing a Search Engine for the Dark Web.

Looking forward to Friday or Monday!

April 1, 2015

Full-Text Search in Javascript (Part 1: Relevance Scoring)

Filed under: Javascript,Lucene,Search Engines,Searching — Patrick Durusau @ 7:47 pm

Full-Text Search in Javascript (Part 1: Relevance Scoring) by Barak Kanber.

From the post:

Full-text search, unlike most of the topics in this machine learning series, is a problem that most web developers have encountered at some point in their daily work. A client asks you to put a search field somewhere, and you write some SQL along the lines of WHERE title LIKE %:query%. It’s convincing at first, but then a few days later the client calls you and claims that “search is broken!”

Of course, your search isn’t broken, it’s just not doing what the client wants. Regular web users don’t really understand the concept of exact matches, so your search quality ends up being poor. You decide you need to use full-text search. With some MySQL fidgeting you’re able to set up a FULLTEXT index and use a more evolved syntax, the “MATCH() … AGAINST()” query.

Great! Problem solved. For smallish databases.

As you hit the hundreds of thousands of records, you notice that your database is sluggish. MySQL just isn’t great at full-text search. So you grab ElasticSearch, refactor your code a bit, and deploy a Lucene-driven full-text search cluster that works wonders. It’s fast and the quality of results is great.

Which leads you to ask: what the heck is Lucene doing so right?

This article (on TF-IDF, Okapi BM-25, and relevance scoring in general) and the next one (on inverted indices) describe the basic concepts behind full-text search.

Illustration of search engine concepts in Javascript with code for download. You can tinker to your heart’s delight.

Enjoy!

PS: Part 2 is promised in the next “several” weeks. Will be watching for it.

March 4, 2015

The Ultimate Google Algorithm Cheat Sheet

Filed under: Algorithms,Search Engines — Patrick Durusau @ 1:07 pm

The Ultimate Google Algorithm Cheat Sheet by Neil Patel.

From the post:

Have you ever wondered why Google keeps pumping out algorithm updates?

Do you understand what Google algorithm changes are all about?

No SEO or content marketer can accurately predict what any future update will look like. Even some Google employees don’t understand everything that’s happening with the web’s most dominant search engine.

All this confusion presents a problem for savvy site owners: Since there are about 200 Google ranking factors, which ones should you focus your attention on after each major Google algorithm update?

Google has issued four major algorithm updates, named (in chronological order) Panda, Penguin, Hummingbird and Pigeon. In between these major updates, Google engineers also made some algorithm tweaks that weren’t that heavily publicized, but still may have had an impact on your website’s rankings in search results.

A very detailed post describing the history of changes to the Google algorithm, the impact of those changes and suggestions for how you can improve your search rankings.

The post alone is worth your attention but so are the resources that Neil relies upon, such as:

200 Google ranking factors, which really is a list of 200 Google ranking factors.

I first saw this in a tweet by bloggerink.

February 23, 2015

Apache Solr 5.0 Highlights

Filed under: Search Engines,Solr — Patrick Durusau @ 9:04 pm

Apache Solr 5.0 Highlights by Anshum Gupta.

From the post:

Usability improvements

A lot of effort has gone into making Solr more usable, mostly along the lines of introducing APIs and hiding implementation details for users who don’t need to know. Solr 4.10 was released with scripts to start, stop and restart Solr instance, 5.0 takes it further in terms of what can be done with those. The scripts now for instance, copy a configset on collection creation so that the original isn’t changed. There’s also a script to index documents as well as the ability to delete collections in Solr. As an example, this is all you need to do to start SolrCloud, index lucidworks.com, browse through what’s been indexed, and clean up the collection.

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted http://lucidworks.com
open http://localhost:8983/solr/gettingstarted/browse
bin/solr delete -c gettingstarted

Another important thing to note for new users is that Solr no longer has the default collection1 and instead comes with multiple example config-sets and data.

That is just a tiny part of Gupta’s post on just highlights of Apache Solr 5.0.

Download Apache Solr 5.0 to experience the improvements for yourself!

Download the improved Solr 5.0 Reference Manual as well!

February 22, 2015

Apache Solr 5.0.0 and Reference Guide for 5.0 available

Filed under: Search Engines,Solr — Patrick Durusau @ 7:24 pm

Apache Solr 5.0.0 and Reference Guide for 5.0 available

For the impatient:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Changes.txt

https://s.apache.org/Solr-Ref-Guide-PDF

From the post::

Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world’s largest internet sites.

Solr 5.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of details.

Solr 5.0 Release Highlights:

  • Usability improvements that include improved bin scripts and new and restructured examples.
  • Scripts to support installing and running Solr as a service on Linux.
  • Distributed IDF is now supported and can be enabled via the config. Currently, there are four supported implementations for the same:

    • LocalStatsCache: Local document stats.
    • ExactStatsCache: One time use aggregation
    • ExactSharedStatsCache: Stats shared across requests
    • LRUStatsCache: Stats shared in an LRU cache across requests
  • Solr will no longer ship a war file and instead be a downloadable application.
  • SolrJ now has first class support for Collections API.
  • Implicit registration of replication,get and admin handlers.
  • Config API that supports paramsets for easily configuring solr parameters and configuring fields. This API also supports managing of pre-existing request handlers and editing common solrconfig.xml via overlay.
  • API for managing blobs allows uploading request handler jars and registering them via config API.
  • BALANCESHARDUNIQUE Collection API that allows for even distribution of custom replica properties.
  • There’s now an option to not shuffle the nodeSet provided during collection creation.
  • Option to configure bandwidth usage by Replication handler to prevent it from using up all the bandwidth.
  • Splitting of clusterstate to per-collection enables scalability improvement in SolrCloud. This is also the default format for new Collections that would be created going forward.
  • timeAllowed is now used to prematurely terminate requests during query expansion and SolrClient request retry.
  • pivot.facet results can now include nested stats.field results constrained by those pivots.
  • stats.field can be used to generate stats over the results of arbitrary numeric functions.
    It also allows for requesting for statistics for pivot facets using tags.
  • A new DateRangeField has been added for indexing date ranges, especially multi-valued ones.
  • Spatial fields that used to require units=degrees now take distanceUnits=degrees/kilometers miles instead.
  • MoreLikeThis query parser allows requesting for documents similar to an existing document and also works in SolrCloud mode.
  • Logging improvements:

    • Transaction log replay status is now logged
    • Optional logging of slow requests.

Solr 5.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Also available is the Solr Reference Guide for Solr 5.0. This 535 page PDF serves as the definitive user’s manual for Solr 5.0. It can be downloaded from the Apache mirror network: https://s.apache.org/Solr-Ref-Guide-PDF

This is the beginning of a great week!

Enjoy!

February 19, 2015

Solr 5.0 RC3!

Filed under: Search Engines,Solr — Patrick Durusau @ 7:49 pm

Yes, Solr 5.0 RC3 has dropped!

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html which will toss you out at the 4.10.3 release.

Let it take you to the suggested site and then move up and then down into the 5.0 directory.

Download.

Enjoy!

PS: For Lucene, follow the same directions but going to the Lucene download page.

February 13, 2015

Solr 5.0 Will See Another RC – But Docs Are Available

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 12:08 pm

I saw a tweet from Anshum Gupta today saying:

Though the vote passed, seems like there’s need for another RC for #Apache #Lucene / #Solr 5.0. Hopefully we’d be third time lucky.

To brighten your weekend prospects, the Apache Solr Reference Guide for Solr 5.0 is available.

With an other Solr RC on the horizon, now would be a good time to spend some time with the reference guide. Both in terms of new features and to smooth out any infelicities in the documentation.

February 11, 2015

Darpa Is Developing a Search Engine for the Dark Web

Filed under: Dark Data,DARPA,Search Engines — Patrick Durusau @ 7:19 pm

Darpa Is Developing a Search Engine for the Dark Web by Kim Zetter.

From the post:

A new search engine being developed by Darpa aims to shine a light on the dark web and uncover patterns and relationships in online data to help law enforcement and others track illegal activity.

The project, dubbed Memex, has been in the works for a year and is being developed by 17 different contractor teams who are working with the military’s Defense Advanced Research Projects Agency. Google and Bing, with search results influenced by popularity and ranking, are only able to capture approximately five percent of the internet. The goal of Memex is to build a better map of more internet content.

Reading how Darpa is build yet another bigger dirt pile, I was reminded of Rick Searle saying:


Rather than think of Big Data as somehow providing us with a picture of reality, “naturally emerging” as Mayer-Schönberger quoted above suggested we should start to view it as a way to easily and cheaply give us a metric for the potential validity of a hypothesis. And it’s not only the first step that continues to be guided by old fashioned science rather than computer driven numerology but the remaining steps as well, a positive signal followed up by actual scientist and other researchers doing such now rusting skills as actual experiments and building theories to explain their results. Big Data, if done right, won’t end up making science a form of information promising, but will instead be used as the primary tool for keeping scientist from going down a cul-de-sac.

The same principle applied to mass surveillance means a return to old school human intelligence even if it now needs to be empowered by new digital tools. Rather than Big Data being used to hoover up and analyze all potential leads, espionage and counterterrorism should become more targeted and based on efforts to understand and penetrate threat groups themselves. The move back to human intelligence and towards more targeted surveillance rather than the mass data grab symbolized by Bluffdale may be a reality forced on the NSA et al by events. In part due to the Snowden revelations terrorist and criminal networks have already abandoned the non-secure public networks which the rest of us use. Mass surveillance has lost its raison d’etre.

In particular because the project is designed to automatically discover “relationships:”


But the creators of Memex don’t want just to index content on previously undiscovered sites. They also want to use automated methods to analyze that content in order to uncover hidden relationships that would be useful to law enforcement, the military, and even the private sector. The Memex project currently has eight partners involved in testing and deploying prototypes. White won’t say who the partners are but they plan to test the system around various subject areas or domains. The first domain they targeted were sites that appear to be involved in human trafficking. But the same technique could be applied to tracking Ebola outbreaks or “any domain where there is a flood of online content, where you’re not going to get it if you do queries one at a time and one link at a time,” he says.

I for one am very sure the new system (I refuse to sully the historical name by associating it with this doomed DARPA project), will find relationships, many relationships in fact. Too many relationships for any organization, now matter how large, to sort the relevant for the irrelevant.

If you want to segment the task, you could say that data mining is charged with finding relationships.

However, the next step is data analysis is to evaluate the evidence for the relationships found in the preceding step.

The step after evaluating the evidence for relationships discovered is to determine what, if anything, those relationships mean to some question at hand.

In all but the simplest of cases, there will be even more steps than the ones I listed. All of which must occur before you have extracted reliable intelligence from the data mining exercise.

Having data to search is a first step. Searching for and finding relationships in data is another step. But if that is where the search trail ends, you have just wasted another $10 to $20 million that could have gone for worthwhile intelligence gathering.

Older Posts »

Powered by WordPress