Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 19, 2015

DARPA: MEMEX (Domain-Specific Search) Drops!

Filed under: DARPA,Search Engines,Searching,Tor — Patrick Durusau @ 12:49 pm

The DARPA MEMEX project is now listed on its Open Catalog page!

Forty (40) separate components listed by team, project, category, link to code, description and license. Each sortable of course.

No doubt DARPA has held back some of its best work but looking over the descriptions, there are no bojums or quantum leaps beyond current discussions in search technology. How far you can push the released work beyond its current state is an exercise for the reader.

Machine learning is mentioned in the descriptions for DeepDive, Formasaurus and SourcePin. No explicit mention of deep learning, at least in the descriptions.

If you prefer to not visit the DARPA site, I have gleaned the essential information (project, link to code, description) into the following list:

  • ACHE: ACHE is a focused crawler. Users can customize the crawler to search for different topics or objects on the Web. (Java)
  • Aperture Tile-Based Visual Analytics: New tools for raw data characterization of ‘big data’ are required to suggest initial hypotheses for testing. The widespread use and adoption of web-based maps has provided a familiar set of interactions for exploring abstract large data spaces. Building on these techniques, we developed tile based visual analytics that provide browser-based interactive visualization of billions of data points. (JavaScript/Java)
  • ArrayFire: ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its array-based function set makes parallel programming simple. ArrayFire’s multiple backends (CUDA, OpenCL, and native CPU) make it platform independent and highly portable. A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving users valuable time and lowering development costs. (C, C++, Python, Fortran, Java)
  • Autologin: AutoLogin is a utility that allows a web crawler to start from any given page of a website (for example the home page) and attempt to find the login page, where the spider can then log in with a set of valid, user-provided credentials to conduct a deep crawl of a site to which the user already has legitimate access. AutoLogin can be used as a library or as a service. (Python)
  • CubeTest: Official evaluation metric used for evaluation for TREC Dynamic Domain Track. It is a multiple-dimensional metric that measures the effectiveness of complete a complex and task-based search process. (Perl)
  • Data Microscopes: Data Microscopes is a collection of robust, validated Bayesian nonparametric models for discovering structure in data. Models for tabular, relational, text, and time-series data can accommodate multiple data types, including categorical, real-valued, binary, and spatial data. Inference and visualization of results respects the underlying uncertainty in the data, allowing domain experts to feel confident in the quality of the answers they receive. (Python, C++)
  • DataWake: The Datawake project consists of various server and database technologies that aggregate user browsing data via a plug-in using domain-specific searches. This captured, or extracted, data is organized into browse paths and elements of interest. This information can be shared or expanded amongst teams of individuals. Elements of interest which are extracted either automatically, or manually by the user, are given weighted values. (Python/Java/Scala/Clojure/JavaScript)
  • DeepDive: DeepDive is a new type of knowledge base construction system that enables developers to analyze data on a deeper level than ever before. Many applications have been built using DeepDive to extract data from millions of documents, Web pages, PDFs, tables, and figures. DeepDive is a trained system, which means that it uses machine-learning techniques to incorporate domain-specific knowledge and user feedback to improve the quality of its analysis. DeepDive can deal with noisy and imprecise data by producing calibrated probabilities for every assertion it makes. DeepDive offers a scalable, high-performance learning engine. (SQL, Python, C++)
  • DIG: DIG is a visual analysis tool based on a faceted search engine that enables rapid, interactive exploration of large data sets. Users refine their queries by entering search terms or selecting values from lists of aggregated attributes. DIG can be quickly configured for a new domain through simple configuration. (JavaScript)
  • Dossier Stack: Dossier Stack provides a framework of library components for building active search applications that learn what users want by capturing their actions as truth data. The frameworks web services and javascript client libraries enable applications to efficiently capture user actions such as organizing content into folders, and allows back end algorithms to train classifiers and ranking algorithms to recommend content based on those user actions. (Python/JavaScript/Java)
  • Dumpling: Dumpling implements a novel dynamic search engine which refines search results on the fly. Dumpling utilizes the Winwin algorithm and the Query Change retrieval Model (QCM) to infer the user’s state and tailor search results accordingly. Dumpling provides a friendly user interface for user to compare the static results and dynamic results. (Java, JavaScript, HTML, CSS)
  • FacetSpace: FacetSpace allows the investigation of large data sets based on the extraction and manipulation of relevant facets. These facets may be almost any consistent piece of information that can be extracted from the dataset: names, locations, prices, etc… (JavaScript)
  • Formasaurus: Formasaurus is a Python package that tells users the type of an HTML form: is it a login, search, registration, password recovery, join mailing list, contact form or something else. Under the hood it uses machine learning. (Python)
  • Frontera: Frontera (formerly Crawl Frontier) is used as part of a web crawler, it can store URLs and prioritize what to visit next. (Python)
  • HG Profiler: HG Profiler is a tool that allows users to take a list of entities from a particular source and look for those same entities across a pre-defined list of other sources. (Python)
  • Hidden Service Forum Spider: An interactive web forum analysis tool that operates over Tor hidden services. This tool is capable of passive forum data capture and posting dialog at random or user-specifiable intervals. (Python)
  • HSProbe (The Tor Hidden Service Prober): HSProbe is a python multi-threaded STEM-based application designed to interrogate the status of Tor hidden services (HSs) and extracting hidden service content. It is an HS-protocol savvy crawler, that uses protocol error codes to decide what to do when a hidden service is not reached. HSProbe tests whether specified Tor hidden services (.onion addresses) are listening on one of a range of pre-specified ports, and optionally, whether they are speaking over other specified protocols. As of this version, support for HTTP and HTTPS is implemented. Hsprobe takes as input a list of hidden services to be probed and generates as output a similar list of the results of each hidden service probed. (Python)
  • ImageCat: ImageCat analyses images and extracts their EXIF metadata and any text contained in the image via OCR. It can handle millions of images. (Python, Java)
  • ImageSpace: ImageSpace provides the ability to analyze and search through large numbers of images. These images may be text searched based on associated metadata and OCR text or a new image may be uploaded as a foundation for a search. (Python)
  • Karma: Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs. Users integrate information by modelling it according to an ontology of their choice using a graphical user interface that automates much of the process. (Java, JavaScript)
  • LegisGATE: Demonstration application for running General Architecture Text Engineering over legislative resources. (Java)
  • Memex Explorer: Memex Explorer is a pluggable framework for domain specific crawls, search, and unified interface for Memex Tools. It includes the capability to add links to other web-based apps (not just Memex) and the capability to start, stop, and analyze web crawls using 2 different crawlers – ACHE and Nutch. (Python)
  • MITIE: Trainable named entity extractor (NER) and relation extractor. (C)
  • Omakase: Omakase provides a simple and flexible interface to share data, computations, and visualizations between a variety of user roles in both local and cloud environments. (Python, Clojure)
  • pykafka: pykafka is a Python driver for the Apache Kafka messaging system. It enables Python programmers to publish data to Kafka topics and subscribe to existing Kafka topics. It includes a pure-Python implementation as well as an optional C driver for increased performance. It is the only Python driver to have feature parity with the official Scala driver, supporting both high-level and low-level APIs, including balanced consumer groups for high-scale uses. (Python)
  • Scrapy Cluster: Scrapy Cluster is a scalable, distributed web crawling cluster based on Scrapy and coordinated via Kafka and Redis. It provides a framework for intelligent distributed throttling as well as the ability to conduct time-limited web crawls. (Python)
  • Scrapy-Dockerhub: Scrapy-Dockerhub is a deployment setup for Scrapy spiders that packages the spider and all dependencies into a Docker container, which is then managed by a Fabric command line utility. With this setup, users can run spiders seamlessly on any server, without the need for Scrapyd which typically handles the spider management. With Scrapy-Dockerhub, users issue one command to deploy spider with all dependencies to the server and second command to run it. There are also commands for viewing jobs, logs, etc. (Python)
  • Shadow: Shadow is an open-source network simulator/emulator hybrid that runs real applications like Tor and Bitcoin over a simulated Internet topology. It is light-weight, efficient, scalable, parallelized, controllable, deterministic, accurate, and modular. (C)
  • SMQTK: Kitware’s Social Multimedia Query Toolkit (SMQTK) is an open-source service for ingesting images and video from social media (e.g. YouTube, Twitter), computing content-based features, indexing the media based on the content descriptors, querying for similar content, and building user-defined searches via an interactive query refinement (IQR) process. (Python)
  • SourcePin: SourcePin is a tool to assist users in discovering websites that contain content they are interested in for a particular topic, or domain. Unlike a search engine, SourcePin allows a non-technical user to leverage the power of an advanced automated smart web crawling system to generate significantly more results than the manual process typically does, in significantly less time. The User Interface of SourcePin allows users to quickly across through hundreds or thousands of representative images to quickly find the websites they are most interested in. SourcePin also has a scoring system which takes feedback from the user on which websites are interesting and, using machine learning, assigns a score to the other crawl results based on how interesting they are likely to be for the user. The roadmap for SourcePin includes integration with other tools and a capability for users to actually extract relevant information from the crawl results. (Python, JavaScript)
  • Splash: Lightweight, scriptable browser as a service with an HTTP API. (Python)
  • streamparse: streamparse runs Python code against real-time streams of data. It allows users to spin up small clusters of stream processing machines locally during development. It also allows remote management of stream processing clusters that are running Apache Storm. It includes a Python module implementing the Storm multi-lang protocol; a command-line tool for managing local development, projects, and clusters; and an API for writing data processing topologies easily. (Python, Clojure)
  • TellFinder: TellFinder provides efficient visual analytics to automatically characterize and organize publicly available Internet data. Compared to standard web search engines, TellFinder enables users to research case-related data in significantly less time. Reviewing TellFinder’s automatically characterized groups also allows users to understand temporal patterns, relationships and aggregate behavior. The techniques are applicable to various domains. (JavaScript, Java)
  • Text.jl: Text.jl provided numerous tools for text processing optimized for the Julia language. Functionality supported include algorithms for feature extraction, text classification, and language identification. (Julia)
  • TJBatchExtractor: Regex based information extractor for online advertisements (Java).
  • Topic: This tool takes a set of text documents, filters by a given language, and then produces documents clustered by topic. The method used is Probabilistic Latent Semantic Analysis (PLSA). (Python)
  • Topic Space: Tool for visualization for topics in document collections. (Python)
  • Tor: The core software for using and participating in the Tor network. (C)
  • The Tor Path Simulator (TorPS): TorPS quickly simulates path selection in the Tor traffic-secure communications network. It is useful for experimental analysis of alternative route selection algorithms or changes to route selection parameters. (C++, Python, Bash)
  • TREC-DD Annotation: This Annotation Tool supports the annotation task in creating ground truth data for TREC Dynamic Domain Track. It adopts drag and drop approach for assessor to annotate passage-level relevance judgement. It also supports multiple ways of browsing and search in various domains of corpora used in TREC DD. (Python, JavaScript, HTML, CSS)

Beyond whatever use you find for the software, it is also important in terms of what capabilities are of interest to DARPA and by extension to those interested in militarized IT.

April 5, 2015

Building a complete Tweet index

Filed under: Indexing,Searching,Twitter — Patrick Durusau @ 10:46 am

Building a complete Tweet index by Yi Zhuang.

Since it is Easter Sunday in many religious traditions, what could be more inspirational than “…a search service that efficiently indexes roughly half a trillion documents and serves queries with an average latency of under 100ms.“?

From the post:

Today [11/8/2014], we are pleased to announce that Twitter now indexes every public Tweet since 2006.

Since that first simple Tweet over eight years ago, hundreds of billions of Tweets have captured everyday human experiences and major historical events. Our search engine excelled at surfacing breaking news and events in real time, and our search index infrastructure reflected this strong emphasis on recency. But our long-standing goal has been to let people search through every Tweet ever published.

This new infrastructure enables many use cases, providing comprehensive results for entire TV and sports seasons, conferences (#TEDGlobal), industry discussions (#MobilePayments), places, businesses and long-lived hashtag conversations across topics, such as #JapanEarthquake, #Election2012, #ScotlandDecides, #HongKong. #Ferguson and many more. This change will be rolling out to users over the next few days.

In this post, we describe how we built a search service that efficiently indexes roughly half a trillion documents and serves queries with an average latency of under 100ms.

The most important factors in our design were:

  • Modularity: Twitter already had a real-time index (an inverted index containing about a week’s worth of recent Tweets). We shared source code and tests between the two indices where possible, which created a cleaner system in less time.
  • Scalability: The full index is more than 100 times larger than our real-time index and grows by several billion Tweets a week. Our fixed-size real-time index clusters are non-trivial to expand; adding capacity requires re-partitioning and significant operational overhead. We needed a system that expands in place gracefully.
  • Cost effectiveness: Our real-time index is fully stored in RAM for low latency and fast updates. However, using the same RAM technology for the full index would have been prohibitively expensive.
  • Simple interface: Partitioning is unavoidable at this scale. But we wanted a simple interface that hides the underlying partitions so that internal clients can treat the cluster as a single endpoint.
  • Incremental development: The goal of “indexing every Tweet” was not achieved in one quarter. The full index builds on previous foundational projects. In 2012, we built a small historical index of approximately two billion top Tweets, developing an offline data aggregation and preprocessing pipeline. In 2013, we expanded that index by an order of magnitude, evaluating and tuning SSD performance. In 2014, we built the full index with a multi-tier architecture, focusing on scalability and operability.

If you are interested in scaling search issues, this is a must read post!

Kudos to Twitter Engineering!

PS: Of course all we need now is a complete index to Hilary Clinton’s emails. The NSA probably has a copy.

You know, the NSA could keep the same name, National Security Agency, and take over providing backups and verification for all email and web traffic, including the cloud. Would have to work on who could request copies but that would resolve the issue of backups of the Internet rather neatly. No more deleted emails, tweets, etc.

That would be a useful function, as opposed to harvesting phone data on the premise that at some point in the future it might prove to be useful, despite having not proved useful in the past.

April 1, 2015

Full-Text Search in Javascript (Part 1: Relevance Scoring)

Filed under: Javascript,Lucene,Search Engines,Searching — Patrick Durusau @ 7:47 pm

Full-Text Search in Javascript (Part 1: Relevance Scoring) by Barak Kanber.

From the post:

Full-text search, unlike most of the topics in this machine learning series, is a problem that most web developers have encountered at some point in their daily work. A client asks you to put a search field somewhere, and you write some SQL along the lines of WHERE title LIKE %:query%. It’s convincing at first, but then a few days later the client calls you and claims that “search is broken!”

Of course, your search isn’t broken, it’s just not doing what the client wants. Regular web users don’t really understand the concept of exact matches, so your search quality ends up being poor. You decide you need to use full-text search. With some MySQL fidgeting you’re able to set up a FULLTEXT index and use a more evolved syntax, the “MATCH() … AGAINST()” query.

Great! Problem solved. For smallish databases.

As you hit the hundreds of thousands of records, you notice that your database is sluggish. MySQL just isn’t great at full-text search. So you grab ElasticSearch, refactor your code a bit, and deploy a Lucene-driven full-text search cluster that works wonders. It’s fast and the quality of results is great.

Which leads you to ask: what the heck is Lucene doing so right?

This article (on TF-IDF, Okapi BM-25, and relevance scoring in general) and the next one (on inverted indices) describe the basic concepts behind full-text search.

Illustration of search engine concepts in Javascript with code for download. You can tinker to your heart’s delight.

Enjoy!

PS: Part 2 is promised in the next “several” weeks. Will be watching for it.

March 18, 2015

Interactive Intent Modeling: Information Discovery Beyond Search

Interactive Intent Modeling: Information Discovery Beyond Search by Tuukka Ruotsalo, Giulio Jacucci, Petri Myllymäki, Samuel Kaski.

From the post:

Combining intent modeling and visual user interfaces can help users discover novel information and dramatically improve their information-exploration performance.

Current-generation search engines serve billions of requests each day, returning responses to search queries in fractions of a second. They are great tools for checking facts and looking up information for which users can easily create queries (such as “Find the closest restaurants” or “Find reviews of a book”). What search engines are not good at is supporting complex information-exploration and discovery tasks that go beyond simple keyword queries. In information exploration and discovery, often called “exploratory search,” users may have difficulty expressing their information needs, and new search intents may emerge and be discovered only as they learn by reflecting on the acquired information. 8,9,18 This finding roots back to the “vocabulary mismatch problem” 13 that was identified in the 1980s but has remained difficult to tackle in operational information retrieval (IR) systems (see the sidebar “Background”). In essence, the problem refers to human communication behavior in which the humans writing the documents to be retrieved and the humans searching for them are likely to use very different vocabularies to encode and decode their intended meaning. 8,21

Assisting users in the search process is increasingly important, as everyday search behavior ranges from simple look-ups to a spectrum of search tasks 23 in which search behavior is more exploratory and information needs and search intents uncertain and evolving over time.

We introduce interactive intent modeling, an approach promoting resourceful interaction between humans and IR systems to enable information discovery that goes beyond search. It addresses the vocabulary mismatch problem by giving users potential intents to explore, visualizing them as directions in the information space around the user’s present position, and allowing interaction to improve estimates of the user’s search intents.

What!? All those years spend trying to beat users into learning complex search languages were in vain? Say it’s not so!

But, apparently it is so. All of the research on “vocabulary mismatch problem,” “different vocabularies to encode and decode their meaning,” has come back to bite information systems that offer static and author-driven vocabularies.

Users search best, no surprise, through vocabularies they recognize and understand.

I don’t know of any interactive topic maps in the sense used here but that doesn’t mean that someone isn’t working on one.

A shift in this direction could do wonders for the results of searches.

March 17, 2015

On Lemmings and PageRank

Filed under: PageRank,Searching,Software — Patrick Durusau @ 4:04 pm

Solving Open Source Discovery by Andrew Nesbitt.

From the post:

Today I’m launching Libraries.io, a project that I’ve been working on for the past couple of months.

The intention is to help developers find new open source libraries, modules and frameworks and keep track of ones they depend upon.

The world of open source software depends on a lot of open source libraries. We are standing on the shoulders of giants, which helps us to reach further than we could otherwise.

The problem with platforms like Rubygems and NPM is there are so many libraries, with hundreds of new ones added every day. Trying to find the right library can be overwhelming.

How do you find libraries that help you solve problems? How do you then know which of those libraries are worth using?

Andrew substitutes dependencies for links in a page rank algorithm and then:

Within Libraries.io I’ve aggregated over 700,000 projects, written in 130 languages from across 22 package managers, including dependencies, releases, license information and source code repository infomation. This results in a rich index of almost every open source library available for use today.

Follow me on Twitter at @teabass and @librariesio for updates. Discussion on Hacker News: https://news.ycombinator.com/item?id=9211084.

Is Libraries.io going to be useful? Yes!

Is Libraries.io a fun way to explore projects? Yes!

Is Libraries.io a great alternative to current source search options? Yes!

Is Libraries.io the solution to open source discovery? Less clear.

I say that because PageRank, whether using hyperlinks or dependencies, results in a lemming view of the world in question.

Wikipedia reports this is an image of a lemming:

Lemming

I, on the other hand, bear a passing resemblance to this image:

patrick-photo

I offer those images as evidence that I am not a lemming! 😉

The opinions and usages of others can be of interest, but I follow work and people of interest to me, not because they are of interest to others. Otherwise I would be following Lady Gaga on Twitter, for example. To save you the trouble of downloading her forty-five million (45M) followers, I hereby attest that I am not one of them.

Make no mistake, Andrew’s work should be used, followed, supported, improved, but as another view of an important data set, not a solution.

I first saw this in a tweet by Arfon Smith.

Internet Search as a Crap Shoot in a Black Box

Filed under: Search Algorithms,Search Interface,Searching — Patrick Durusau @ 2:57 pm

The post, Google To Launch New Doorway Page Penalty Algorithm by Barry Schwartz reminded me that Internet search is truly a crap shoot in a black box.

Google has over two hundred (200) factors that are known (or suspected) to play a role in its search algorithms and their ranking of results.

Even if you memorized the 200, if you are searching you don’t know how those factors will impact pages with information you want to see. (Unless you want to drive web traffic, the 200 factors are a curiosity and not much more.)

When you throw your search terms, like dice, in to the Google black box, you don’t know how they will interact with the unknown results of the ranking algorithms.

To make matters worse, yes, worse, the Google algorithms change over time. Some major, some not quite so major. But every change stands a chance to impact any ad hoc process you have adopted for finding information.

A good number of you won’t remember print indexes but one of their attractive features (in hindsight) was that the indexing was uniform, at least within reasonable limits, for decades. If you learned how to effectively use the printed index, you could always find information using that technique, without fear that the familiar results would simply disappear.

Perhaps that is a commercial use case for the Common Crawl data. Imagine a disclosed ranking algorithm that could be exposed to create a custom ranking for a sub-set of the data against which to perform searches. So the ranking against which you are searching is known and can be explored.

It would not have the very latest data but that’s difficult to extract from Google since it apparently tosses the information about when it first encountered a page. Or at the very least doesn’t make it available to users. At least as an option, being able to pick the most recent resources matching a search would be vastly superior to the page-rank orthodoxy at Google.

Not to single Google out too much because I haven’t encountered other search engines that are more transparent. They may exist but I am unaware of them.

March 16, 2015

ADS: The Next Generation Search Platform

Filed under: Astroinformatics,Bibliography,Searching — Patrick Durusau @ 6:28 pm

ADS: The Next Generation Search Platform by Alberto Accomazzi et al.

Abstract:

Four years after the last LISA meeting, the NASA Astrophysics Data System (ADS) finds itself in the middle of major changes to the infrastructure and contents of its database. In this paper we highlight a number of features of great importance to librarians and discuss the additional functionality that we are currently developing. Starting in 2011, the ADS started to systematically collect, parse and index full-text documents for all the major publications in Physics and Astronomy as well as many smaller Astronomy journals and arXiv e-prints, for a total of over 3.5 million papers. Our citation coverage has doubled since 2010 and now consists of over 70 million citations. We are normalizing the affiliation information in our records and, in collaboration with the CfA library and NASA, we have started collecting and linking funding sources with papers in our system. At the same time, we are undergoing major technology changes in the ADS platform which affect all aspects of the system and its operations. We have rolled out and are now enhancing a new high-performance search engine capable of performing full-text as well as metadata searches using an intuitive query language which supports fielded, unfielded and functional searches. We are currently able to index acknowledgments, affiliations, citations, funding sources, and to the extent that these metadata are available to us they are now searchable under our new platform. The ADS private library system is being enhanced to support reading groups, collaborative editing of lists of papers, tagging, and a variety of privacy settings when managing one’s paper collection. While this effort is still ongoing, some of its benefits are already available through the ADS Labs user interface and API at this http URL

Now for a word from the people who were using “big data” before it was a buzz word!

The focus here is on smaller data, publications, but it makes a good read.

I have been following the work on Solr proper and am interested in learning more about the extensions created to Solr by ADS.

Enjoy!

I first saw this in a tweet by Kirk Borne.

January 4, 2015

Find In-Depth Articles (Google Hack)

Filed under: Search Engines,Searching — Patrick Durusau @ 7:56 pm

Find In-Depth Articles by Alex Chitu.

From the post:

Sometimes you want to find more about a topic and you find a lot of superficial news articles and blog posts that keep rehashing the same information. Google shows a list of in-depth articles for some queries, but this feature seems to be restricted to the US and it’s only displayed for some queries.

See Alex’s post for the search string addition.

Works with the examples but be forewarned, it doesn’t work with every search.

I tried “deep+learning” and got the usual results.

If you are researching topics where Google has in-depth articles, this could be quite useful.

Just glancing at some of the other posts, this looks like a blog to follow if you do any amount of searching.

Enjoy!

I first saw this in a tweet by Aaron Kirschenfeld.

December 20, 2014

Monte-Carlo Tree Search for Multi-Player Games [Semantics as Multi-Player Game]

Filed under: Games,Monte Carlo,Search Trees,Searching,Semantics — Patrick Durusau @ 2:25 pm

Monte-Carlo Tree Search for Multi-Player Games by Joseph Antonius Maria Nijssen.

From the introduction:

The topic of this thesis lies in the area of adversarial search in multi-player zero-sum domains, i.e., search in domains having players with conflicting goals. In order to focus on the issues of searching in this type of domains, we shift our attention to abstract games. These games provide a good test domain for Artificial Intelligence (AI). They offer a pure abstract competition (i.e., comparison), with an exact closed domain (i.e., well-defined rules). The games under investigation have the following two properties. (1) They are too complex to be solved with current means, and (2) the games have characteristics that can be formalized in computer programs. AI research has been quite successful in the field of two-player zero-sum games, such as chess, checkers, and Go. This has been achieved by developing two-player search techniques. However, many games do not belong to the area where these search techniques are unconditionally applicable. Multi-player games are an example of such domains. This thesis focuses on two different categories of multi-player games: (1) deterministic multi-player games with perfect information and (2) multi-player hide-and-seek games. In particular, it investigates how Monte-Carlo Tree Search can be improved for games in these two categories. This technique has achieved impressive results in computer Go, but has also shown to be beneficial in a range of other domains.

This chapter is structured as follows. First, an introduction to games and the role they play in the field of AI is provided in Section 1.1. An overview of different game properties is given in Section 1.2. Next, Section 1.3 defines the notion of multi-player games and discusses the two different categories of multi-player games that are investigated in this thesis. A brief introduction to search techniques for two-player and multi-player games is provided in Section 1.4. Subsequently, Section 1.5 defines the problem statement and four research questions. Finally, an overview of this thesis is provided in Section 1.6.

This thesis is great background reading on the use of Monte-Carol tree search in games. While reading the first chapter, I realized that assigning semantics to a token is an instance of a multi-player game with hidden information. That is the “semantic” of any token doesn’t exist in some Platonic universe but rather is the result of some N number of players who also accept a particular semantic for some given token in a particular context. And we lack knowledge of the semantic and the reasons for it that will be assigned by some N number of players, which may change over time and context.

The semiotic triangle of Ogden and Richards (The Meaning of Meaning):

300px-Ogden_semiotic_triangle

for any given symbol, represents the view of a single speaker. But as Ogden and Richards note, what is heard by listeners should be represented by multiple semiotic triangles:

Normally, whenever we hear anything said we spring spontaneously to an immediate conclusion, namely, that the speaker is referring to what we should be referring to were we speaking the words ourselves. In some cases this interpretation may be correct; this will prove to be what he has referred to. But in most discussions which attempt greater subtleties than could be handled in a gesture language this will not be so. (The Meaning of Meaning, page 15 of the 1923 edition)

Is RDF/OWL more subtle than can be handled by a gesture language? If you think so then you have discovered one of the central problems with the Semantic Web and any other universal semantic proposal.

Not that topic maps escape a similar accusation, but with topic maps you can encode additional semiotic triangles in an effort to avoid confusion, at least to the extent of funding and interest. And if you aren’t trying to avoid confusion, you can supply semiotic triangles that reach across understandings to convey additional information.

You can’t avoid confusion altogether nor can you achieve perfect communication with all listeners. But, for some defined set of confusions or listeners, you can do more than simply repeat your original statements in a louder voice.

Whether Monte-Carlo Tree searches will help deal with the multi-player nature of semantics isn’t clear but it is an alternative to repeating “…if everyone would use the same (my) system, the world would be better off…” ad nauseam.

I first saw this in a tweet by Ebenezer Fogus.

December 2, 2014

Nonsensical ‘Unbiased Search’ Proposal

Filed under: EU,Governance,Search Engines,Searching — Patrick Durusau @ 4:50 pm

Forget EU’s Toothless Vote To ‘Break Up’ Google; Be Worried About Nonsensical ‘Unbiased Search’ Proposal by Mike Masnick.

Mike uncovers (in plain sight) the real danger of the recent EU proposal to “break up” Google.

Reading the legislation (which I neglected to do), Mike writes:

But within the proposal, a few lines down, there was something that might be even more concerning, and more ridiculous, even if it generated fewer (actually, almost no) headlines. And it’s that, beyond “breaking up” search engines, the resolution also included this bit of nonsense, saying that search engines need to be “unbiased”:

Stresses that, when operating search engines for users, the search process and results should be unbiased in order to keep internet searches non-discriminatory, to ensure more competition and choice for users and consumers and to maintain the diversity of sources of information; notes, therefore, that indexation, evaluation, presentation and ranking by search engines must be unbiased and transparent; calls on the Commission to prevent any abuse in the marketing of interlinked services by search engine operators;

But what does that even mean? Search is inherently biased. That’s the point of search. You want the best results for what you’re searching for, and the job of the search engine is to rank results by what it thinks is the best. An “unbiased” search engine isn’t a search engine at all. It just returns stuff randomly.

See Mike’s post for additional analysis of this particular mummers farce.

Another example why the Internet should be governed by a new structure, staffed by people with the technical knowledge to make sensible decisions. By “new structure” I mean one separate from and not subject to any existing government. Including the United States, where the head of the NSA thinks local water supplies are controlled over the Internet (FALSE).

I first saw this in a tweet by Joseph Esposito.

November 25, 2014

Treasury Island: the film

Filed under: Archives,Government,Government Data,History,Indexing,Library,Searching — Patrick Durusau @ 5:52 pm

Treasury Island: the film by Lauren Willmott, Boyce Keay, and Beth Morrison.

From the post:

We are always looking to make the records we hold as accessible as possible, particularly those which you cannot search for by keyword in our catalogue, Discovery. And we are experimenting with new ways to do it.

The Treasury series, T1, is a great example of a series which holds a rich source of information but is complicated to search. T1 covers a wealth of subjects (from epidemics to horses) but people may overlook it as most of it is only described in Discovery as a range of numbers, meaning it can be difficult to search if you don’t know how to look. There are different processes for different periods dating back to 1557 so we chose to focus on records after 1852. Accessing these records requires various finding aids and multiple stages to access the papers. It’s a tricky process to explain in words so we thought we’d try demonstrating it.

We wanted to show people how to access these hidden treasures, by providing a visual aid that would work in conjunction with our written research guide. Armed with a tablet and a script, we got to work creating a video.

Our remit was:

  • to produce a video guide no more than four minutes long
  • to improve accessibility to these records through a simple, step-by–step process
  • to highlight what the finding aids and documents actually look like

These records can be useful to a whole range of researchers, from local historians to military historians to social historians, given that virtually every area of government action involved the Treasury at some stage. We hope this new video, which we intend to be watched in conjunction with the written research guide, will also be of use to any researchers who are new to the Treasury records.

Adding video guides to our written research guides are a new venture for us and so we are very keen to hear your feedback. Did you find it useful? Do you like the film format? Do you have any suggestions or improvements? Let us know by leaving a comment below!

This is a great illustration that data management isn’t something new. The Treasury Board has kept records since 1557 and has accumulated a rather extensive set of materials.

The written research guide looks interesting but since I am very unlikely to ever research Treasury Board records, I am unlikely to need it.

However, the authors have anticipated that someone might be interested in process of record keeping itself and so provided this additional reference:

Thomas L Heath, The Treasury (The Whitehall Series, 1927, GP Putnam’s Sons Ltd, London and New York)

That would be an interesting find!

I first saw this in a tweet by Andrew Janes.

November 18, 2014

Twitter Now Lets You Search For Any Tweet Ever Sent

Filed under: Searching,Tweets — Patrick Durusau @ 1:46 pm

Twitter Now Lets You Search For Any Tweet Ever Sent by Cade Metz.

From the post:


This morning, Twitter began rolling out a search service that lets you search for any tweet in its archive.

Though the new Twitter search engine is limited to rather rudimentary keyword searches today, the company plans to expand into more complex queries in the months and years to come. And the foundational search infrastructure laid down by the company will help drive other Twitter tools as well. “It lets us power a lot more things down the road—not just search,” says Gilad Mishne, the Twitter engineering director who helped oversee the project.

Well, that’s both good news and better news!

Good news because of being able to search and link to the full corpus of tweets.

Better news because of the search market gap that Cade reports, which is quite similar to Google’s.

You can search for anything you want, but the results, semantically speaking, are going to be a crap shoot.

Do users really have time for hit or miss search results? Some do, some don’t.

If yours don’t, let’s talk.

November 5, 2014

Google and Mission Statements

Filed under: Google+,Indexing,Searching — Patrick Durusau @ 4:55 pm

Google has ‘outgrown’ its 14-year old mission statement, says Larry Page by Samuel Gibbs.

From the post:

Google’s chief executive Larry Page has admitted that the company has outgrown its mission statement to “organise the world’s information and make it universally accessible and useful” from the launch of the company in 1998, but has said he doesn’t yet know how to redefine it.

Page insists that the company is still focused on the altruistic principles that it was founded on in 1998 with the original mission statement, when he and co-founder Sergey Brin were aiming big with “societal goals” to “organise the world’s information and make it universally accessible and useful”.

Questioned as to whether Google needs to alter its mission statement, which was twinned with the company mantra “don’t be evil, for the next stage of company growth in an interview with the Financial Times, Page responded: “We’re in a bit of uncharted territory. We’re trying to figure it out. How do we use all these resources … and have a much more positive impact on the world?”

This post came as a surprise to me because I was unaware that Google had solved the problem of “organis[ing] the world’s information and mak[ing] it universally accessible and useful.”

Perhaps so but it hasn’t made it to the server farm that sends results to me.

A quick search using Google on “cia” today produces a front page with resources on the Central Intelligence Agency, the Culinary Institute of American, Certified Internal Auditor (CIA) Certification and allegedly, 224,000,000 more results.

If I search using “Central Intelligence Agency,” I get a “purer” stream of content on the Central Intelligence Agency, that runs from its official website, https://www.cia.gov, to the Wikipedia article, http://en.wikipedia.org/wiki/Central_Intelligence_Agency, and ArtsBeat | Can’t Afford a Giacometti Sculpture? There’s Always the CIA’s bin Laden Action Figure .

Even with a detailed query Google search results remind me of a line from Saigon Warrior that goes:

But the organization is a god damned disgrace

https://www.youtube.com/watch?v=0-U9Ns9oG6E

If Larry Page thinks Google has “organise[d] the world’s information and ma[de] it universally accessible and useful,” he needs a reality check.

True, Google has gone further than any other enterprise towards indexing some of the world’s information, but hardly all of it nor is it usefully organized.

Why expand Google’s corporate mission when the easy part of the earlier mission has been accomplished and the hard part is about to start?

Perhaps some enterprising journalist will ask Page why Google is dodging the hard part of organizing information? Yes?

October 13, 2014

Measuring Search Relevance

Filed under: Relevance,Searching — Patrick Durusau @ 10:33 am

Measuring Search Relevance by Hugh E. Williams.

From the post:

The process of asking many judges to assess search performance is known as relevance judgment: collecting human judgments on the relevance of search results. The basic task goes like this: you present a judge with a search result, and a search engine query, and you ask the judge to assess how relevant the item is to the query on (say) a four-point scale.

Suppose the query you want to assess is ipod nano 16Gb. Imagine that one of the results is a link to Apple’s page that describes the latest Apple iPod nano 16Gb. A judge might decide that this is a “great result” (which might be, say, our top rating on the four-point scale). They’d then click on a radio button to record their vote and move on to the next task. If the result we showed them was a story about a giraffe, the judge might decide this result is “irrelevant” (say the lowest rating on the four point scale). If it were information about an iPhone, it might be “partially relevant” (say the second-to-lowest), and if it were a review of the latest iPod nano, the judge might say “relevant” (it’s not perfect, but it sure is useful information about an Apple iPod).

The human judgment process itself is subjective, and different people will make different choices. You could argue that a review of the latest iPod nano is a “great result” — maybe you think it’s even better than Apple’s page on the topic. You could also argue that the definitive Apple page isn’t terribly useful in making a buying decision, and you might only rate it as relevant. A judge who knows everything about Apple’s products might make a different decision to someone who’s never owned an digital music player. You get the idea. In practice, judging decisions depend on training, experience, context, knowledge, and quality — it’s an art at best.

There are a few different ways to address subjectivity and get meaningful results. First, you can ask multiple judges to assess the same results to get an average score. Second, you can judge thousands of queries, so that you can compute metrics and be confident statistically that the numbers you see represent true differences in performance between algorithms. Last, you can train your judges carefully, and give them information about what you think relevance means.

An illustrated walk through measuring search relevance. Useful for a basic understanding of the measurement process and its parameters.

Bookmark this post so When you tell your judges what “…relevance means”, you can return here and post what you told your judges.

I ask because I deeply suspect that our ideas of “relevance” vary widely from subject to subject.

Thanks!

September 12, 2014

Connected Histories: British History Sources, 1500-1900

Filed under: History,Search Engines,Searching — Patrick Durusau @ 4:24 pm

Connected Histories: British History Sources, 1500-1900

From the webpage:

Connected Histories brings together a range of digital resources related to early modern and nineteenth century Britain with a single federated search that allows sophisticated searching of names, places and dates, as well as the ability to save, connect and share resources within a personal workspace. We have produced this short video guide to introduce you to the key features.

Twenty-two remarkable resources can be searched by place, person, or keyword. Some of the sources require subscriptions but the vast majority do not. A summary of the resources would fail to do them justice so here is a list of the currently searchable resources:

As you probably assume, there is no binding point for any person, object, date or thing across all twenty-two resources with its associations to other persons, objects, dates or things.

As you explore Connected Histories, keep track of where you found information on a person, object, date or thing. Depending on the granularity of pointing, you might want to create a topic map to capture that information.

September 8, 2014

Demystifying The Google Knowledge Graph

Filed under: Entities,Google Knowledge Graph,Search Engines,Searching — Patrick Durusau @ 3:28 pm

Demystifying The Google Knowledge Graph by Barbara Starr.

knowledge graph

Barbara covers:

  • Explicit vs. Implicit Entities (and how to determine which is which on your webpages)
  • How to improve your chances of being in “the Knowledge Graph” using Schema.org and JSON-LD.
  • Thinking about “things, not strings.”

Is there something special about “events?” I remember the early Semantic Web motivations being setting up tennis matches between colleagues. The examples here are of sporting and music events.

If your users don’t know how to use TicketMaster, repeating delivery of that data on your site isn’t going to help them.

On the other hand, this is a good reminder to extract from Schema.org all the “types” that would be useful for my blog.

PS: A “string” doesn’t become a “thing” simply because it has a longer token. Having an agreed upon “longer token” from a vocabulary such as Schema.org does provide more precise identification than an unadorned “string.”

Having said that, the power of having several key/value pairs and a declaration of which ones must, may or must not match, should be readily obvious. Particularly when those keys and values may themselves be collections of key/value pairs.

August 25, 2014

An Introduction to Congress.gov

Filed under: Law,Searching — Patrick Durusau @ 4:50 pm

An Introduction to Congress.gov by Robert Brammer.

From the post:

Barbara Bavis, Ashley Sundin, and I are happy to bring you an introduction to Congress.gov. This video provides a brief explanation of how to use the new features in the latest release, such as accounts, saved searches, member remarks in the Congressional Record, and executive nominations. If you would like more in-depth training on Congress.gov, we hold bi-monthly webinars that are free and available to the public. Our next webinar is scheduled from 2-3 p.m. on September 25, 2014, and you can sign up for it on Law.gov. Do you have an opinion on Congress.gov that you would like to share with us, such as new features that you would like to see added to the site? Please let us know by completing the following survey. Also, if there is something you would like us to cover in a future video, please leave us a comment below.

There are mid-term elections this year (2014) and information on current members of Congress will be widely sought.

The video is only twenty (20) minutes but will help you quickly search a variety of information concerning Congress.

Take special note that once you discover information, the system does not bundle it together for the next searcher.

July 7, 2014

Search Suggestions

Filed under: Humor,Searching — Patrick Durusau @ 6:38 pm

James Hughes posted this image of search suggestions to Twitter:

search suggestions

How do you check search suggestions?

June 9, 2014

RegexTip

Filed under: Regex,Regexes,Searching — Patrick Durusau @ 2:00 pm

RegexTip

RegexTip is a Twitter account maintained by John D. Cook and it sends out one (1) regex tip per week.

Regexes or regular expressions are everywhere in computer science but especially in search.

I just saw a tweet by Scientific Python that the cycle of regex tips has restarted with the basics.

Good time to follow RegexTip.

May 21, 2014

Your own search engine…

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 4:46 pm

Your own search engine (based on Apache Solr open-source enterprise-search)

From the webpage:

Tools for easier searching with free software on your own server

  • search in many documents, images and files
    • full text search with powerful search operators
    • in many different formats (text, word, openoffice, PDF, sheets, csv, doc, images, jpg, video and many more)
    • get a overview by explorative search and comfortable and powerful navigation with faceted search (easy to use interactive filters)
  • analyze documents (preview, extracted text, wordlists and visualizations with wordclouds and trend charts)
  • structure your research, investigation, navigation, metadata or notes (semantic wiki for tagging documents, annotations and structured notes)
  • OCR: automatic text recognition for images and graphical content or scans inside PDF, i.e. for scanned or photographed documents

Do you think this would be a way to pull back the curtain on search a bit? To show people that even results like we see from Google require more than casual effort?

I ask because Jeni Tennison tweeted earlier today:

#TDC14 @emckean “search is the hammer that makes us think everything is a nail that can be searched for”

Is a common misunderstanding of search making “improved” finding methods a difficult sell?

Not that I have a lot of faith or interest in educating potential purchasers. Finding a way to use the misunderstanding seems like a better marketing strategy to me.

Suggestions?

Practical Relevance Ranking for 11 Million Books, Part 1

Filed under: Relevance,Searching — Patrick Durusau @ 1:54 pm

Practical Relevance Ranking for 11 Million Books, Part 1 by Tom Burton-West.

From the post:

This is the first in a series of posts about our work towards practical relevance ranking for the 11 million books in the HathiTrust full-text search application.

Relevance is a complex concept which reflects aspects of a query, a document, and the user as well as contextual factors. Relevance involves many factors such as the user’s preferences, the user’s task, the user’s stage in their information-seeking, the user’s domain knowledge, the user’s intent, and the context of a particular search.

While many different kinds of relevance have been discussed in the literature, topical relevance is the one most often used in testing relevance ranking algorithms. Topical relevance is a measure of “aboutness”, and attempts to measure how much a document is about the topic of a user’s query.

At its core, relevance ranking depends on an algorithm that uses term statistics, such as the number of times a query term appears in a document, to provide a topical relevance score. Other ranking features that try to take into account more complex aspects of relevance are built on top of this basic ranking algorithm.

In many types of search, such as e-commerce or searching for news, factors other than the topical relevance (based on the words in the document) are important. For example, a search engine for e-commerce might have facets such as price, color, size, availability, and other attributes, that are of equal importance to how well the user’s query terms match the text of a document describing a product. In news retrieval, recency[iii] and the location of the user might be factored into the relevance ranking algorithm. (footnotes omitted)

Great post that discusses the impact of the length of a document on its relevancy ranking by Lucene/Solr. That impact is well known but how to move from studies on relevancy studies with short documents to long documents (books) isn’t known.

I am looking forward to Part 2, which will cover the relationship between relevancy and document length.

May 19, 2014

… Characters of Clojure

Filed under: Clojure,Searching — Patrick Durusau @ 2:33 pm

The Weird and Wonderful Characters of Clojure by James Hughes.

From the post:

A reference collection of characters used in Clojure that are difficult to “google”. Descriptions sourced from various blogs, StackOverflow, Learning Clojure and the official Clojure docs – sources attributed where necessary. Use CTRL-F “Character: …” to search or type the symbols into the box below. Sections not in any particular order but related items are grouped for ease. If I’m wrong or missing anything worthy of inclusion tweet me @kouphax or mail me at james@yobriefca.se.

Definitely a candidate for your browser toolbar!

I first saw this in a tweet by Daniel Higginbotham.

May 4, 2014

“Credibility” As “Google Killer”?

Filed under: Facebook,Relevance,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 6:26 pm

Nancy Baym tweets: “Nice article on flaws of ”it’s not our fault, it’s the algorithm” logic from Facebook with quotes from @TarletonG” pointing to: Facebook draws fire on ‘related articles’ push.

From the post:

A surprise awaited Facebook users who recently clicked on a link to read a story about Michelle Obama’s encounter with a 10-year-old girl whose father was jobless.

Facebook responded to the click by offering what it called “related articles.” These included one that alleged a Secret Service officer had found the president and his wife having “S*X in Oval Office,” and another that said “Barack has lost all control of Michelle” and was considering divorce.

A Facebook spokeswoman did not try to defend the content, much of which was clearly false, but instead said there was a simple explanation for why such stories are pushed on readers. In a word: algorithms.

The stories, in other words, apparently are selected by Facebook based on mathematical calculations that rely on word association and the popularity of an article. No effort is made to vet or verify the content.

Facebook’s explanation, however, is drawing sharp criticism from experts who said the company should immediately suspend its practice of pushing so-called related articles to unsuspecting users unless it can come up with a system to ensure that they are credible. (emphasis added)

Just imagine the hue and outcry had that last line read:

Imaginary Quote Google’s explanation of search results, however, is drawing sharp criticism from experts who said the company should immediately suspend its practice of pushing so-called related articles to unsuspecting users unless it can come up with a system to ensure that they are credible. End Imaginary Quote

Is demanding “credibility” of search results the long sought after “Google Killer?”

“Credibility” is closely related to the “search” problem but I think it should be treated separately from search.

In part because the “credibility” question is one that can require multiple searches upon the author of search result content, searches for reviews and comments on search result content, searches of other sources of data on the content in the search result and then a collation of that additional content to make a credibility judgement on the search result content. The procedure isn’t always that elaborate but the main point is that it requires additional searching and evaluation of content to even begin to answer a credibility question.

Not to mention why the information is being sought has a bearing on credibility. If I want to find examples of nutty things said about President Obama to cite, then finding the cases mentioned above is not only relevant (the search question) but also “credible” in the sense that Facebook did not make they up. They are published nutty statements about the current President.

What if a user wanted to search for “coffee and bagels?” The top hit on one popular search engine today is: Coffee Meets Bagel: Free Online Dating Sites, along with numerous other links to information on the first link. Was this relevant to my search? No, but search results aren’t always predictable. They are relevant to someone’s search using “coffee and bagels.”

It is the responsibility of every reader to decide for themselves what is relevant, credible, useful, etc. in terms of content, whether it is hard copy or digital.

Any other solution takes us to Plato‘s Republic, which was great to read about, would not want to live there.

May 3, 2014

OpenPolicy [Patent on Paragraphs?]

Filed under: Government,Patents,Searching — Patrick Durusau @ 3:29 pm

OpenPolicy: Knowledge Makes Document Searches Smarter

From the webpage:

The government has a wealth of policy knowledge derived from specialists in myriad fields. What it lacked, until now, was a flexible method for searching the content of thousands of policies using the knowledge of those experts. LMI has developed a tool—OpenPolicy™—to provide agencies with the ability to capture the knowledge of their experts and use it to intuitively search their massive storehouse of policy at hyper speeds.

Traditional search engines produce document-level results. There’s no simple way to search document contents and pinpoint appropriate paragraphs. OpenPolicy solves this problem. The search tool, running on a semantic-web database platform, LMI SME-developed ontologies, and web-based computing power, can currently host tens of thousands of pages of electronic documents. Using domain-specific vocabularies (ontologies), the tool also suggests possible search terms and phrases to help users refine their search and obtain better results.

For agencies wanting to use OpenPolicy, LMI initially builds a powerful computing environment to host the knowledgebase. It then loads all of an agency’s documents—policies, regulations, meeting notes, trouble tickets, essentially any text-based file—into the database. The system can scale to store billions of paragraphs.

No detail on the technology behind OpenPolicy but the mention of paragraphs is enough to make me wary of possible patents on paragraphs.

I am hopeful that even the USPTO would balk at patenting paragraphs in general or as the results of a search but I would not bet money on it.

If you know of any such patents, please post them in comments below.

I first saw this at: LMI Named a Winner in Destination Innovation Competition by Angela Guess.

April 11, 2014

Google Top 10 Search Tips

Filed under: Search Engines,Searching — Patrick Durusau @ 6:47 pm

Google Top 10 Search Tips by Karen Blakeman.

From the post:

These are the top 10 tips from the participants of a recent workshop on Google, organised by UKeiG and held on 9th April 2014. The edited slides from the day can be found on authorSTREAM at http://www.authorstream.com/Presentation/karenblakeman-2121264-making-google-behave-techniques-better-results/ and on Slideshare at http://www.slideshare.net/KarenBlakeman/making-google-behave-techniques-for-better-results

Ten search tips from the trenches. Makes a very nice cheat sheet.

April 9, 2014

Revealing the Uncommonly Common…

Filed under: Algorithms,ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:34 pm

Revealing the Uncommonly Common with Elasticsearch by Mark Harwood.

From the summary:

Mark Harwood shows how anomaly detection algorithms can spot card fraud, incorrectly tagged movies and the UK’s most unexpected hotspot for weapon possession.

Makes me curious about the market for a “Mr./Ms. Normal” service?

A service that enables you to enter your real viewing/buying/entertainment preferences and for a fee, the service generates a paper trail for you than hides your real habits in digital dust.

If you order porn from NetFlix then the “Mr./Ms. Normal” service will order enough PBS and NatGeo material to even out your renting record.

Depending on how extreme your buying habits happen to be, you may need a “Mr./Ms. Abnormal” service that shields you from any paper trail at all.

As data surveillance grows, having a pre-defined Mr./Ms. Normal/Abnormal account may become a popular high school/college graduation or even a wedding present.

The usefulness of data surveillance depends on the cooperation of its victims. Have you ever considered not cooperating? But appearing to?

March 26, 2014

Elasticsearch 1.1.0,…

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 7:00 pm

Elasticsearch 1.1.0, 1.0.2 and 0.90.13 released by Clinton Gormley.

From the post:

Today we are happy to announce the release of Elasticsearch 1.1.0, based on Lucene 4.7, along with bug fix releases Elasticsearch 1.0.2 and Elasticsearch 0.90.13:

You can download them and read the full changes list here:

New features in 1.1.0

Elasticsearch 1.1.0 is packed with new features: better multi-field search, the search templates and the ability to create aliases when creating an index manually or with a template. In particular, the new aggregations framework has enabled us to support more advanced analytics: the cardinality agg for counting unique values, the significant_terms agg for finding uncommonly common terms, and the percentiles agg for understanding data distribution.

We will be blogging about all of these new features in more detail, but for now we’ll give you a taste of what each feature adds:

….

Well, there’s goes the rest of the week! 😉

March 24, 2014

Google Search Appliance and Libraries

Using Google Search Appliance (GSA) to Search Digital Library Collections: A Case Study of the INIS Collection Search by Dobrica Savic.

From the post:

In February 2014, I gave a presentation at the conference on Faster, Smarter and Richer: Reshaping the library catalogue (FSR 2014), which was organized by the Associazione Italiana Biblioteche (AIB) and Biblioteca Apostolica Vaticana in Rome, Italy. My presentation focused on the experience of the International Nuclear Information System (INIS) in using Google Search Appliance (GSA) to search digital library collections at the International Atomic Energy Agency (IAEA). 

Libraries are facing many challenges today. In addition to diminished funding and increased user expectations, the use of classic library catalogues is becoming an additional challenge. Library users require fast and easy access to information resources, regardless of whether the format is paper or electronic. Google Search, with its speed and simplicity, has established a new standard for information retrieval which did not exist with previous generations of library search facilities. Put in a position of David versus Goliath, many small, and even larger libraries, are losing the battle to Google, letting many of its users utilize it rather than library catalogues.

The International Nuclear Information System (INIS)

The International Nuclear Information System (INIS) hosts one of the world's largest collections of published information on the peaceful uses of nuclear science and technology. It offers on-line access to a unique collection of 3.6 million bibliographic records and 483,000 full texts of non-conventional (grey) literature. This large digital library collection suffered from most of the well-known shortcomings of the classic library catalogue. Searching was complex and complicated, it required training in Boolean logic, full-text searching was not an option, and response time was slow. An opportune moment to improve the system came with the retirement of the previous catalogue software and the adoption of Google Search Appliance (GSA) as an organization-wide search engine standard.
….

To be completely honest, my first reaction wasn’t a favorable one.

But even the complete blog post does not do justice to the project in question.

Take a look at the slides, which include screen shots of the new interface before reaching an opinion.

Take this as a lesson on what your search interface should be offering by default.

There are always other screens you can fill with advanced features.

March 21, 2014

Elasticsearch: The Definitive Guide

Filed under: ElasticSearch,Indexing,Search Engines,Searching — Patrick Durusau @ 5:52 pm

Elasticsearch: The Definitive Guide (Draft)

From the Preface, who should read this book:

This book is for anybody who wants to put their data to work. It doesn’t matter whether you are starting a new project and have the flexibility to design the system from the ground up, or whether you need to give new life to a legacy system. Elasticsearch will help you to solve existing problems and open the way to new features that you haven’t yet considered.

This book is suitable for novices and experienced users alike. We expect you to have some programming background and, although not required, it would help to have used SQL and a relational database. We explain concepts from first principles, helping novices to gain a sure footing in the complex world of search.

The reader with a search background will also benefit from this book. Elasticsearch is a new technology which has some familiar concepts. The more experienced user will gain an understanding of how those concepts have been implemented and how they interact in the context of Elasticsearch. Even in the early chapters, there are nuggets of information that will be useful to the more advanced user.

Finally, maybe you are in DevOps. While the other departments are stuffing data into Elasticsearch as fast as they can, you’re the one charged with stopping their servers from bursting into flames. Elasticsearch scales effortlessly, as long as your users play within the rules. You need to know how to setup a stable cluster before going into production, then be able to recognise the warning signs at 3am in the morning in order to prevent catastrophe. The earlier chapters may be of less interest to you but the last part of the book is essential reading — all you need to know to avoid meltdown.

I fully understand the need, nay, compulsion for an author to say that everyone who is literate needs to read their book. And, if you are not literate, their book is a compelling reason to become literate! 😉

As the author of a book (two editions) and more than one standard, I can assure you an author’s need to reach everyone serves no one very well.

Potential readers ranges from novices, intermediate users and experts.

A book that targets all three will “waste” space on matter already know to experts but not to novices and/or intermediate users.

At the same time, space in a physical book being limited, some material relevant to the expert will be left out all together.

I had that experience quite recently when the details of LukeRequestHandler (Solr) were described as:

Reports meta-information about a Solr index, including information about the number of terms, which fields are used, top terms in the index, and distributions of terms across the index. You may also request information on a per-document basis.

That’s it. Out of more than 600+ pages of text, that is all the information you will find on LukeRequestHandler.

Fortunately I did find: https://wiki.apache.org/solr/LukeRequestHandler.

I don’t fault the author because several entire books could be written with the material they left out.

That is the hardest part of authoring, knowing what to leave out.

PS: Having said all that, I am looking forward to reading Elasticsearch: The Definitive Guide as it develops.

March 14, 2014

Apache MarkMail

Filed under: Indexing,MarkLogic,Searching — Patrick Durusau @ 6:56 pm

Apache MarkMail

Just in case you don’t have your own index of the 10+ million messages in Apache mailing list archives, this is the site for you.

😉

I ran across it today while debugging an error in a Solr config file.

If I could add one thing to MarkMail it would be software release date facets. Posts are not limited by release dates but I suspect a majority of posts between release dates are about the current release. Enough so that I would find it a useful facet.

You?

« Newer PostsOlder Posts »

Powered by WordPress