RegexBuddy is your perfect companion for working with regular expressions. Easily create regular expressions that match exactly what you want. Clearly understand complex regexes written by others. Quickly test any regex on sample strings and files, preventing mistakes on actual data. Debug without guesswork by stepping through the actual matching process. Use the regex with source code snippets automatically adjusted to the particulars of your programming language. Collect and document libraries of regular expressions for future reuse. GREP (search-and-replace) through files and folders. Integrate RegexBuddy with your favorite searching and editing tools for instant access.
I was reminded of RegexBuddy when I stumbled on the RegexBuddy Manual in a search result.
The XQuery/XPath regex treatment is far briefer than I would like but at 500+ pages, it’s an impressive bit of work. Even without a copy of RegexBuddy, working through the examples will make you a regex terrorist.
The only unfortunate aspect, for *nix users, is that you need to run RegexBuddy in a Windows VM. 🙁
If you are comfortable with Emacs, Windows or otherwise, then the Occur mode comes to mind. It doesn’t have the visuals of RegexBuddy but then you are accustomed to a power-user environment.
I’ve seen the HTML in question. The description seems a bit generous to me. 😉
Try your hand at regexes and see if your productivity increases!
Posted in Regexes, Searching | Comments Off on RegexBuddy (Think Occur Mode for Emacs)
I thank the authors of this Cochrane review for providing their search strategies in the document Appendix. Upon trying to reproduce the Ovid MEDLINE search strategy, we came across several errors. It is unclear whether these are transcription errors or represent actual errors in the performed search strategy, though likely the former.
For instance, in line 39, the search is “tumour bed boost.sh.kw.ti.ab” [quotes not in original]. The correct syntax would be “tumour bed boost.sh,kw,ti,ab” [no quotes]. The same is true for line 41, where the commas are replaced with periods.
In line 42, the search is “Breast Neoplasms /rt.sh” [quotes not in original]. It is not entirely clear what the authors meant here, but likely they meant to search the MeSH heading Breast Neoplasms with the subheading radiotherapy. If that is the case, the search should have been “Breast Neoplasms/rt” [no quotes].
In lines 43 and 44, it appears as though the authors were trying to search for the MeSH term “Radiotherapy, Conformal” with two different subheadings, which they spell out and end with a subject heading field search (i.e., Radiotherapy, Conformal/adverse events.sh). In Ovid syntax, however, the correct search syntax would be “Radiotherapy, Conformal/ae” [no quotes] without the subheading spelled out and without the extraneous .sh.
In line 47, there is another minor error, again with .sh being extraneously added to the search term “Radiotherapy/” [quotes not in original].
Though these errors are minor and are highly likely to be transcription errors, when attempting to replicate this search, each of these lines produces an error in Ovid. If a searcher is unaware of how to fix these problems, the search becomes unreplicable. Because the search could not have been completed as published, it is unlikely this was actually how the search was performed; however, it is a good case study to examine how even small details matter greatly for reproducibility in search strategies.
A great reminder that replication of searches is a non-trivial task and that search engines are literal to the point of idiocy.
It’s no secret that machine learning is revolutionizing many industries. This is equally true in search, where companies exhaust themselves capturing nuance through manually tuned search relevance. Mature search organizations want to get past the “good enough” of manual tuning to build smarter, self-learning search systems.
That’s why we’re excited to release our Elasticsearch Learning to Rank Plugin. What is learning to rank? With learning to rank, a team trains a machine learning model to learn what users deem relevant.
When implementing Learning to Rank you need to:
Measure what users deem relevant through analytics, to build a judgment list grading documents as exactly relevant, moderately relevant, not relevant, for queries
Hypothesize which features might help predict relevance such as TF*IDF of specific field matches, recency, personalization for the searching user, etc.
Train a model that can accurately map features to a relevance score
Deploy the model to your search infrastructure, using it to rank search results in production
Don’t fool yourself: underneath each of these steps lie complex, hard technical and non-technical problems. There’s still no silver bullet. As we mention in Relevant Search, manual tuning of search results comes with many of the same challenges as a good learning to rank solution. We’ll have more to say about the many infrastructure, technical, and non-technical challenges of mature learning to rank solutions in future blog posts.
… (emphasis in original)
A great post as always but of particular interest for topic map fans is this passage:
Many of these features aren’t static properties of the documents in the search engine. Instead they are query dependent – they measure some relationship between the user or their query and a document. And to readers of Relevant Search, this is what we term signals in that book.
… (emphasis in original)
Do you read this as suggesting the merging exhibited to users should depend upon their queries?
That two or more users, with different query histories could (should?) get different merged results from the same topic map?
Now that’s an interesting suggestion!
Enjoy this post and follow the blog for more of same.
(I have a copy of Relevant Search waiting to be read so I had better get to it!)
The Egyptological museum search is a PHP tool aimed to facilitate locating the descriptions and images of ancient Egyptian objects in online catalogues of major museums. Online catalogues (ranging from selections of highlights to complete digital inventories) are now offered by almost all major museums holding ancient Egyptian items and have become indispensable in research work. Yet the variety of web interfaces and of search rules may overstrain any person performing many searches in different online catalogues.
Egyptological museum search was made to provide a single search point for finding objects by their inventory numbers in major collections of Egyptian antiquities that have online catalogues. It tries to convert user input into search queries recognised by museums’ websites. (Thus, for example, stela Geneva D 50 is searched as “D 0050,” statue Vienna ÄS 5046 is searched as “AE_INV_5046,” and coffin Turin Suppl. 5217 is searched as “S. 05217.”) The following online catalogues are supported:
How well does search support your business? Are your investments in smarter, more relevant search, paying off? These are business-level questions, not technical ones!
After writing Relevant Search we find ourselves helping clients evaluate their search and discovery investments. Many invest far too little, or struggle to find the areas to make search smarter, unsure of the ROI. Others invest tremendously in supposedly smarter solutions, but have a hard time justifying the expense or understanding the impact of change.
That’s why we’re happy to announce OpenSource Connection’s official search relevance methodology!
The free ebook? Guide to Relevance Investments.
I know, I know, the title is a interest killer.
Think Search ROI. Not something you hear about often but it sounds attractive.
Runs 16 pages and is a blessed relief from the “data has value (unspecified)” mantras.
Search and investment in search is a business decision and this guide nudges you in that direction.
What you do next is up to you.
Posted in Searching | Comments Off on Guide to Making Search Relevance Investments, free ebook
WikiLeaks series on deals involving Hillary Clinton campaign Chairman John Podesta. Mr Podesta is a long-term associate of the Clintons and was President Bill Clinton’s Chief of Staff from 1998 until 2001. Mr Podesta also owns the Podesta Group with his brother Tony, a major lobbying firm and is the Chair of the Center for American Progress (CAP), a Washington DC-based think tank.
long enough for them to be decried as “interference” with the U.S. presidential election.
You have two search options, basic:
As handy as these search interfaces are, you cannot easily:
Analyze relationships between multiple senders and/or recipients of emails
Perform entity recognition across the emails as a corpus
The dump contains a set of exploits, implants and tools for hacking firewalls (“Firewall Operations”). This post aims to be a comprehensive list of all the tools contained or referenced in the dump.
Mustafa’s post is a great illustration of why “being found beats searching, every time.”
Think of the cycles you would have to spend to duplicate this list. Multiple that by the number of people interested in this list. Assuming their time is not valueless, do you start to see the value-add of Mustafa’s post?
Mustafa found each of these items in the data dump and then preserved his finding for the use of others.
It’s not a very big step beyond this preservation to the creation of a container for each of these items, enabling the preservation of other material found on them or related to them.
Search is a starting place and not a destination.
Unless you enjoy repeating the same finding process over and over again.
Just to show I don’t spend too much time thinking of ways to gnaw on the ankles of Süddeutsche Zeitung (SZ), the hoarders of the Panama Papers, here is my experience with visual searching with Google today.
I saw this image on Twitter:
I assumed that cutting the “clutter” from around the cluster might produce a better result. Besides, the plastic separators looked (to me) to be standard and not custom made.
Here is my cropped image for searching:
Google responded this looks like: “water.” 😉
OK, so I tried cropping it more just to show the ports, thinking that might turn up similar port arrangements, here’s that image:
Google says: “machinery.” With a number of amusing “similar” images.
BTW, when I tried the full image, the first one, Google says: “electronics.”
OK, so much for Google image searching. What if I try?
Searching on neo4j cluster and raspberry pi (the most likely suspect), my first “hit” had this image:
Reposting the image and links to the posts cited above will help disappoint the mislaid plans to suppress it.
What is more amazing than the chancellor thinking information on the Internet can be suppressed, at least for a paltry $175K, is this pattern will be repeated year after year.
Lying about information known to others is a losing strategy, always.
But that strategy will be picked up by other universities, governments and their agencies, corporations, to say nothing of individuals.
Had UC Davis spent that $175K on better training for its police officers, people would still talk about this event but it would be in contrast to the new and improved way US Davis deals with protesters.
That’s not likely to happen now.
Posted in Politics, Searching | Comments Off on UC Davis Spent $175,000.00 To Suppress This Image (let’s disappoint them)
Here is a rough visualization of the information loss if you use only one of those terms:
If you search for “duplicate detection,” you miss all the articles shaded in blue.
If you search for “coreference resolution,” you miss all the articles shaded in yellow.
Suggestions for improving this visualization?
It is a visualization that could be performed on client’s data, using their search engine/database.
In order to identify the data loss they are suffering now from search across departments.
With the caveat that not all data loss is bad and/or worth avoiding.
Imaginary example (so far): What if you could demonstrate no overlapping of terminology for two vendors for the United States Army and the Air Force. That is no query terms for one returned useful results for the other.
That is a starting point for evaluating the use of topic maps.
While the divergence in terminologies is a given, the next question is: What is the downside to that divergence? What capability is lost due to that divergence?
Assuming you can identify such a capacity, the next question is to evaluate the cost of reducing and/or eliminating that divergence versus the claimed benefit.
I assume the most relevant terms are going to be those internal to customers and/or potential customers.
Interest in working this up into a client prospecting/topic map marketing tool?
Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
Spatial4j has been updated to a new 0.6 version hosted by locationtech.
TermsQuery performance boost by a more aggressive default query caching policy.
IndexSearcher’s default Similarity is now changed to BM25Similarity.
Easier method of defining custom CharTokenizer instances.
Improved defaults for “Similarity” used in Solr, in order to provide better default experience for new users.
Improved “Similarity” defaults for users upgrading: DefaultSimilarityFactory has been removed, implicit default Similarity has been changed to SchemaSimilarityFactory, and SchemaSimilarityFactory has been modified to use BM25Similarity as the default for field types that do not explicitly declare a Similarity.
Deprecated GET methods for schema are now accessible through the bulk API. The output has less details and is not backward compatible.
Users should set useDocValuesAsStored=”false” to preserve sort order on multi-valued fields that have both stored=”true” and docValues=”true”.
Formatted date-times are more consistent with ISO-8601. BC dates are now better supported since they are now formatted with a leading ‘-‘. AD years after 9999 have a leading ‘+’. Parse exceptions have been improved.
Deprecated SolrServer and subclasses have been removed, use SolrClient instead.
The deprecated configuration in solrconfig.xml has been removed. Users must remove it from solrconfig.xml.
SolrClient.shutdown() has been removed, use SolrClient.close() instead.
The deprecated zkCredientialsProvider element in solrcloud section of solr.xml is now removed. Use the correct spelling (zkCredentialsProvider) instead.
Added support for executing Parallel SQL queries across SolrCloud collections. Includes StreamExpression support and a new JDBC Driver for the SQL Interface.
New features and capabilities added to the streaming API.
Added support for SELECT DISTINCT queries to the SQL interface.
New GraphQuery to enable graph traversal as a query operator.
New support for Cross Data Center Replication consisting of active/passive replication for separate SolrClouds hosted in separate data centers.
Filter support added to Real-time get.
Column alias support added to the Parallel SQL Interface.
New command added to switch between non/secure mode in zookeeper.
Now possible to use IP fragments in replica placement rules.
Internet search rankings have a significant impact on consumer choices, mainly because users trust and choose higher-ranked results more than lower-ranked results. Given the apparent power of search rankings, we asked whether they could be manipulated to alter the preferences of undecided voters in democratic elections. Here we report the results of five relevant double-blind, randomized controlled experiments, using a total of 4,556 undecided voters representing diverse demographic characteristics of the voting populations of the United States and India. The fifth experiment is especially notable in that it was conducted with eligible voters throughout India in the midst of India’s 2014 Lok Sabha elections just before the final votes were cast. The results of these experiments demonstrate that (i) biased search rankings can shift the voting preferences of undecided voters by 20% or more, (ii) the shift can be much higher in some demographic groups, and (iii) search ranking bias can be masked so that people show no awareness of the manipulation. We call this type of influence, which might be applicable to a variety of attitudes and beliefs, the search engine manipulation effect. Given that many elections are won by small margins, our results suggest that a search engine company has the power to influence the results of a substantial number of elections with impunity. The impact of such manipulations would be especially large in countries dominated by a single search engine company.
I’m not surprised by SEME (search engine manipulation effect).
Although I would probably be more neutral and say: Search Engine Impact on Voting.
Whether you consider one result or another as the result of “manipulation” is a matter of perspective. No search engine strives to delivery “false” information to users.
In the novel 1984, George Orwell imagines a society in which powerful but hidden forces subtly shape peoples’ perceptions of the truth. By changing words, the emphases put on them, and their presentation, the state is able to alter citizens’ beliefs and behaviors in ways of which they are unaware.
Now imagine today’s Internet search engines did just that kind of thing—that subtle biases in search engine results, introduced deliberately or accidentally, could tip elections unfairly toward one candidate or another, all without the knowledge of voters.
That may seem an unlikely scenario, but recent research suggests it is quite possible. Robert Epstein and Ronald E. Robertson, researchers at the American Institute for Behavioral Research and Technology, conducted experiments that showed the sequence of results from politically oriented search queries can affect how users vote, especially among undecided voters, and biased rankings of search results usually go undetected by users. The outcomes of close elections could result from the deliberate tweaking of search algorithms by search engine companies, and such manipulation would be extremely difficult to detect, the experiments suggest.
Gary’s post is a good supplement to the original article, covering some of the volunteers who are ready to defend the rest of us from biased search results.
Or as I would put it, to inject their biases into search results as opposed to other biases they perceive as being present.
If you are more comfortable describing the search results you want presented as “fair and equitable,” etc., please do so but I prefer the honesty of naming biases as such.
Or as David Bowie once said:
Make your desired bias, direction, etc., a requirement and allow data scientists to get about the business of conveying it.
Quite by accident I discovered the relationship between courses and their texts is hidden in many (approx. 2000) campus bookstore interfaces.
If you visit a physical campus bookstore you can browse courses for their textbooks. Very useful if you are interested the subject but not taking the course.
An online LLM (master’s of taxation) flyer prompted me to check the textbooks for the course work.
A simple enough information request. Find the campus bookstore and browse by course for text listings.
Not so fast!
The online presences of over 1200 campus bookstores are delivered http://www.bkstr.com/, which offers this interface:
Another 748 campus bookstores are delivered by http://bncollege.com/, with a similar interface for textbooks:
I started this post by saying the relationship between courses and their texts is hidden, but that’s not quite right.
The relationship between a meaningless course number and its required/suggested text is visible, but the identification of a course by a numeric string is hardly meaningful to the casual observer. (read not an enrolled student)
Perhaps better to say that a meaningful identification of courses for non-enrolled students and their relationship to required/suggested texts is absent.
That is the relationship of course -> text is present, but not in a form meaningful to anyone other than a student in that course.
Considering two separate vendors across almost 2,000 bookstores deliberately obscure the course -> text relationship, who has to wonder why?
I don’t have any immediate suggestions but when I encounter systematic obscuring of information across vendors, alarm bells start to go off.
Just for completeness sake, you can get around the obscuring of the course -> text relationship by searching for syllabus LLM taxation income OR estate OR corporate or (school name) syllabus LLM taxation income OR estate OR corporate. Extract required/suggested texts from posted syllabi.
PS: If you can offer advice on bookstore interfaces suggest enabling the browsing of courses by name and linking to the required/suggested texts.
Particular embodiments determine that a textual term is not associated with a known meaning. The textual term may be related to one or more users of the social-networking system. A determination is made as to whether the textual term should be added to a glossary. If so, then the textual term is added to the glossary. Information related to one or more textual terms in the glossary is provided to enhance auto-correction, provide predictive text input suggestions, or augment social graph data. Particular embodiments discover new textual terms by mining information, wherein the information was received from one or more users of the social-networking system, was generated for one or more users of the social-networking system, is marked as being associated with one or more users of the social-networking system, or includes an identifier for each of one or more users of the social-networking system. (emphasis in original)
We’ve all been served up search results we weren’t sure about, whether they were for “the best tacos in town” or “how to tell if your dog has eaten chocolate.” With IBM Patent no. 9087304, you no longer have to second-guess the answers you’re given. This new tech helps cognitive machines find the best potential answers to your questions by thinking critically about the trustworthiness and accuracy of each source. Simply put, these machines can use their own judgment to separate the right information from wrong. (From: http://ibmblr.tumblr.com/post/139624929596/weve-all-been-served-up-search-results-we-werent
Did you notice that the 1st for 23 years post did not have a single link for any of the patents mentioned?
You would think IBM would be proud enough to link to its new patents and especially 9087304, that “…separate[s] right information from wrong.”
But if you follow the link for 9087304, you get an impression of one reason IBM didn’t include the link.
The abstract for 9087304 reads:
Method, computer program product, and system to perform an operation for a deep question answering system. The operation begins by computing a concept score for a first concept in a first case received by the deep question answering system, the concept score being based on a machine learning concept model for the first concept. The operation then excludes the first concept from consideration when analyzing a candidate answer and an item of supporting evidence to generate a response to the first case upon determining that the concept score does not exceed a predefined concept minimum weight threshold. The operation then increases a weight applied to the first concept when analyzing the candidate answer and the item of supporting evidence to generate the response to the first case when the concept score exceeds a predefined maximum weight threshold.
I will spare you further recitations from the patent.
Show of hands, do U.S. Patents always require:
#2 but not #1
Judge rankings by # of patents granted accordingly.
By its very nature, breaking news happens unexpectedly. Simply waiting for something to start trending on Twitter is not an option for journalists – you’ll have to actively seek it out.
The most important rule is to switch perspectives with the eyewitness and ask yourself, “What would I tweet if I were an eyewitness to an accident or disaster?”
To find breaking news on Twitter you have to think like a person who’s experiencing something out of the ordinary. Eyewitnesses tend to share what they see unfiltered and directly on social media, usually by expressing their first impressions and feelings. Eyewitness media can include very raw language that reflects the shock felt as a result of the situation. These posts often include misspellings.
In this article, we’ll outline some search terms you can use in order to find breaking news. The list is not intended as exhaustive, but a starting point on which to build and refine searches on Twitter to find the latest information.
Great collections of starter search terms but those are going to vary depending on your domain of “breaking” news.
Good illustration of use of Twitter search operators.
Google is, beyond question, the most utilized and highest performing search engine on the web. However, most of the users who utilize Google do not maximize their potential for getting the most accurate results from their searches.
By using Google Search Operators, you can find exactly what you are looking for quickly and effectively just by changing what you input into the search bar.
If you are searching for something simple on Google like [Funny cats] or [Francis Ford Coppola Movies] there is no need to use search operators. Google will return the results you are looking for effectively no matter how you input the words.
Note: Throughout this article whatever is in between these brackets [ ] is what is being typed into Google.
When [Francis Ford Coppola Movies] is typed into Google, Google reads the query as Francis AND Ford AND Coppola AND Movies. So Google will return pages that have all those words in them, with the most relevant pages appearing first. Which is fine when you’re searching for very broad things, but what if you’re trying to find something specific?
What happens when you’re trying to find a report on the revenue and statistics from the United States National Park System in 1995 from a reliable source, and no using Wikipedia.
I can’t say that Marcela’s guide is comprehensive for Google in 2016, because I am guessing the post was written in 2013. Hard to say if early or late 2013 without more research than I am willing donate. Dating posts makes it easy for readers to spot current or past-use-date information.
For the information that is present, this is a great presentation and list of operators.
One way to use this post is to work through every example but use terms from your domain.
If you are mining the web for news reporting, compete against yourself on successive stories or within a small group.
Great resource for creating a search worksheet for classes.
These days, most everyone is familiar with the concept of crawling the web: a piece of software that systematically reads web pages and the pages they link to, traversing the world-wide web. It’s what Google does, and countless tech firms crawl web pages to accomplish tasks ranging from searches to archiving content to statistical analyses and so on. Web crawling is a task that has been automated by developers in every programming language around, many times — for example, a search for web crawling source code yields well over a million hits.
So when I recently came across a need to crawl some web pages for a project I’ve been working on, I figured I could just go find some source code online and hack it into what I need. (Quick aside: the project is a Python library for managing EXIF metadata on digital photos. More on that in a future blog post.)
But I spent a couple of hours searching and playing with the samples I found, and didn’t get anywhere. Mostly because I’m working in Python version 3, and the most popular Python web crawling code is Scrapy, which is only available for Python 2. I found a few Python 3 samples, but they all seemed to be either too trivial (not avoiding re-scanning the same page, for example) or too needlessly complex. So I decided to write my own Python 3.x web crawler, as a fun little learning exercise and also because I need one.
In this blog post I’ll go over how I approached it and explain some of the code, which I posted on GitHub so that others can use it as well.
Doug has been writing publicly about his hip replacement surgery so I don’t think this has any privacy issues. 😉
I am interested to see what he writes once he is fully recovered.
My contacts at the American Medical Association disavow any knowledge of hip replacement surgery driving patients to write in Python and/or to write web crawlers.
I suppose there could be liability implications, especially for C/C++ programmers who lose their programming skills except for Python following such surgery.
Still, glad to hear Doug has been making great progress and hope that it continues!