Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 15, 2013

Learning the meaning behind words

Filed under: Machine Learning,Meaning — Patrick Durusau @ 6:55 pm

Learning the meaning behind words by By Tomas Mikolov, Ilya Sutskever, and Quoc Le, Google Knowledge.

From the post:

Today computers aren’t very good at understanding human language, and that forces people to do a lot of the heavy lifting—for example, speaking “searchese” to find information online, or slogging through lengthy forms to book a trip. Computers should understand natural language better, so people can interact with them more easily and get on with the interesting parts of life.

While state-of-the-art technology is still a ways from this goal, we’re making significant progress using the latest machine learning and natural language processing techniques. Deep learning has markedly improved speech recognition and image classification. For example, we’ve shown that computers can learn to recognize cats (and many other objects) just by observing large amount of images, without being trained explicitly on what a cat looks like. Now we apply neural networks to understanding words by having them “read” vast quantities of text on the web. We’re scaling this approach to datasets thousands of times larger than what has been possible before, and we’ve seen a dramatic improvement of performance — but we think it could be even better. To promote research on how machine learning can apply to natural language problems, we’re publishing an open source toolkit called word2vec that aims to learn the meaning behind words.

Word2vec uses distributed representations of text to capture similarities among concepts. For example, it understands that Paris and France are related the same way Berlin and Germany are (capital and country), and not the same way Madrid and Italy are. This chart shows how well it can learn the concept of capital cities, just by reading lots of news articles — with no human supervision:
(…)

Google has open sourced the code for word2vec.

I wonder how this would perform on all the RFC’s?

Or all of the papers at Citeseer?

RE|Parse

Filed under: Parsing,Python — Patrick Durusau @ 6:45 pm

RE|PARSE

From the webpage:

Python library/tools for combining and using Regular Expressions in a maintainable way

This library also allows you to:

  • Maintain a database of Regular Expressions
  • Combine them together using Patterns
  • Search, Parse and Output data matched by combined Regex using Python functions.

If you know Regular Expressions already, this library basically just gives you a way to combine them together and hook them up to some callback functions in Python.

This looks like a very useful tool.

Search Humor

Filed under: Humor,Searching — Patrick Durusau @ 3:38 pm

I saw a tweet today by nickbarnwell:

Pro-tip: “Hickey Facts” and “Hickey Facts Datomic” turn up vastly different search results #clojure #datomic

Don’t take my word for it:

Hickey Facts

Hickey Facts Datomic

In case you don’t already know the answer, the first query returns “about” 1,400,000 results and the second query “about” 12,000 results. 😉

Five tips for Delivering a Presentation

Filed under: Communication,Marketing — Patrick Durusau @ 3:29 pm

Five tips for Delivering a Presentation by Hugh E. Williams.

From the post:

I wrote a few weeks ago on writing a presentation. This week, I offer a few thoughts on delivering one – in no particular order. I’m working on my sequel to my post on performance reviews — expect it next week!

Hugh covers:

  1. Eye Contact
  2. Body Language
  3. Don’t Read Notes (or Memorize)
  4. Don’t Read Slides
  5. It’s (almost) Impossible to Speak Too Slowly

Same top tips that we covered in my first speech class in high school.

Well, except for the one about reading slides. 😉 It would have been:

Don’t Read the Overhead Slides.

Technology has changed but poor presenting has not.

The same is true for poor sales technique.

Such as trying to sell customers what you are interested in selling, not what the customer is interested in buying.

That sounds like a bad plan to me.

August 14, 2013

Glottolog

Filed under: Language — Patrick Durusau @ 6:46 pm

Glottolog

From the webpage:

Comprehensive reference information for the world’s languages, especially the lesser known languages.

Information about the different languages, dialects, and families of the world (‘languoids’) is available in the Languoid section. The Langdoc section contains bibliographical information. (…)

Languoid catalogue

Glottolog provides a comprehensive catalogue of the world’s languages, language families and dialects. It assigns a unique and stable identifier (the Glottocode) to (in principle) all languoids, i.e. all families, languages, and dialects. Any variety that a linguist works on should eventually get its own entry. The languoids are organized via a genealogical classification (the Glottolog tree) that is based on available historical-comparative research (see also the Languoids information section).

Langdoc

Langdoc is a comprehensive collection of bibliographical data for the world’s lesser known languages. It provides access to more than 180,000 references of descriptive works such as grammars, dictionaries, word lists, texts etc. Search criteria include author, year, title, country, and genealogical affiliation. References can be downloaded as txt, bib, html, or with the Zotero Firefox plugin.

Interesting language resource.

The authors are interested in additional bibliographies in any format: glottolog@eva.mpg.de

Global map of protests: 2013 so far

Filed under: Government,News — Patrick Durusau @ 3:50 pm

Global map of protests: 2013 so far

From the post:

We know that 2011 was the year of revolution in the Arab world, but how is 2013 shaping up so far? The Global Database of Events pulls together local, national and international news sources and codes them to identify all types of protest from collecting signatures to conducting hunger strikes to rioting.

Mapping the protests that took place in the first six months of 2013 isn’t perfectly accurate because we don’t know how many individuals took part but it does provide an insight into political action around the world.
Click on a protest below to see when it took place and how many times it was mentioned in the press.

• Who made this? John Beieler, Ph.D. Student at Pennsylvania State University with the help of Josh Stevens.

See the GDELT Event Database hompage.

From that homepage:

The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first “realtime social sciences earth observatory.” Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates.

GDELT is designed to help support new theories and descriptive understandings of the behaviors and driving forces of global-scale social systems from the micro-level of the individual through the macro-level of the entire planet by offering realtime synthesis of global societal-scale behavior into a rich quantitative database allowing realtime monitoring and analytical exploration of those trends.

GDELT’s goal is to help uncover previously-obscured spatial, temporal, and perceptual evolutionary trends through new forms of analysis of the vast textual repositories that capture global societal activity, from news and social media archives to knowledge repositories.

Explore the other uses of GDELT while you are at the site.

I saw the post in the Guardian.

Intrinsic vs. Extrinsic Structure

Filed under: Data,Data Structures — Patrick Durusau @ 2:50 pm

Intrinsic vs. Extrinsic Structure by Jesse Johnson.

From the post:

At this point, I think it will be useful to introduce an idea from geometry that is very helpful in pure mathematics, and that I find helpful for understanding the geometry of data sets. This idea is difference between the intrinsic structure of an object (such as a data set) and its extrinsic structure. Have you ever gone into a building, walked down a number of different halls and through different rooms, and when you finally got to where you’re going and looked out the window, you realized that you had no idea which direction you were facing, or which side of the building you were actually on? The intrinsic structure of a building has to do with how the rooms, halls and staircases connect up to each other. The extrinsic structure is how these rooms, halls and staircases sit with respect to the outside world. So, while you’re inside the building you may be very aware of the intrinsic structure, but completely lose track of the extrinsic structure.

You can see a similar distinction with subway maps, such as the famous London tube map. This map records how the different tube stops connect to each other, but it distorts how the stops sit within the city. In other words, the coordinates on the tube map do not represent the physical/GPS coordinates of the different stops. But while you’re riding a subway, the physical coordinates of the different stops are much less important than the inter-connectivity of the stations. In other words, the intrinsic structure of the subway is more important (while you’re riding it) than the extrinsic structure. On the other hand, if you were walking through a city, you would be more interested in the extrinsic structure of the city since, for example, that would tell you the distance in miles (or kilometers) between you and your destination.

Data sets also have both intrinsic and extrinsic structure, though there isn’t a sharp line between where the intrinsic structure ends and the extrinsic structure begins. These are more intuitive terms than precise definitions. In the figure below, which shows three two-dimensional data sets, the set on the left has an intrinsic structure very similar to that of the middle data set: Both have two blobs of data points connected by a narrow neck of data points. However, in the data set on the left the narrow neck forms a roughly straight line. In the center, the tube curves around, so that the entire set roughly follows a circle.

I am looking forward to this series of posts from Jesse.

Bearing in mind that the structure of a data set is impacted by collection goals, methods, and other factors.

Matters that are not (usually) represented in the data set per se.

Social Remains Isolated From ‘Business-Critical’ Data

Filed under: Data Integration,Data Silos,Social Media — Patrick Durusau @ 2:29 pm

Social Remains Isolated From ‘Business-Critical’ Data by Aarti Shah.

From the post:

Social data — including posts, comments and reviews — are still largely isolated from business-critical enterprise data, according to a new report from the Altimeter Group.

The study considered 35 organizations — including Caesar’s Entertainment and Symantec — that use social data in context with enterprise data, defined as information collected from CRM, business intelligence, market research and email marketing, among other sources. It found that the average enterprise-class company owns 178 social accounts and 13 departments — including marketing, human resources, field sales and legal — are actively engaged on social platforms.

“Organizations have invested in social media and tools are consolidating but it’s all happening in a silo,” said Susan Etlinger, the report’s author. “Tools tend to be organized around departments because that’s where budgets live…and the silos continue because organizations are designed for departments to work fairly autonomously.”

Somewhat surprisingly, the report finds social data is often difficult to integrate because it is touched by so many organizational departments, all with varying perspectives on the information. The report also notes the numerous nuances within social data make it problematic to apply general metrics across the board and, in many organizations, social data doesn’t carry the same credibility as its enterprise counterpart. (emphasis added)

Isn’t the definition of a silo the organization of data from a certain perspective?

If so, why would it be surprising that different views on data make it difficult to integrate?

Viewing data from one perspective isn’t the same as viewing it from another perspective.

Not really a question of integration but of how easy/hard it is to view data from a variety of equally legitimate perspectives.

Rather than a quest for “the” view shouldn’t we be asking users: “What view serves you best?”

Building a panopticon:…

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 8:14 am

Building a panopticon: The evolution of the NSA’s XKeyscore by Sean Gallagher.

From the post:

The National Security Agency’s (NSA) apparatus for spying on what passes over the Internet, phone lines, and airways has long been the stuff of legend, with the public catching only brief glimpses into its Leviathan nature. Thanks to the documents leaked by former NSA contractor Edward Snowden, we now have a much bigger picture.

When that picture is combined with federal contract data and other pieces of the public record—as well as information from other whistleblowers and investigators—it’s possible to deduce a great deal about what the NSA has built and what it can do.

We’ve already looked at the NSA’s basic capabilities of collecting, managing, and processing “big data.” But the recently released XKeyscore documents provide a much more complete picture of how the NSA feeds its big data monsters and how it gets “situational awareness” of what’s happening on the Internet. What follows is an analysis of how XKeyscore works and how the NSA’s network surveillance capabilities have evolved over the past decade.

Your mother always tell you as a child that someone was watching over you.

Just change that to “someone is watching you,” and she was right. 😉

It is a good summary of the probable capabilities of the NSA for scooping up packets.

I do find it ironic that the NSA wanted to keep the documents secret always and the Guardian wants to keep some part of them secret until is has mined as many headlines as possible.

I would not be surprised if Snowden/NSA documents are still being leaked in the next presidential election cycle.

Everybody wants to have secrets, just for different reasons.

PS: Is anyone working on a typology for lies? Thought it could be useful when topic mapping the Snowden/NSA documents with presidential documents and post-Obama self-serving testimonials.

Here is a starter list:

  • Least untruthful lies
  • Damned lies
  • Presidential lies
  • Ordered lies
  • National Security lies
  • Accidental lies?
  • Necessary lies
  • Protect the incompetent lies
  • Protect contractor lies
  • Protect corruption lies
  • What else?

August 13, 2013

Are EigenVectors Dangerous?

Filed under: Graphs,Mathematics,Networks,PageRank,Ranking — Patrick Durusau @ 7:44 pm

neo4j: Extracting a subgraph as an adjacency matrix and calculating eigenvector centrality with JBLAS by Mark Needham.

Mark continues his exploration of Eigenvector centrality by adding Eigenvector centrality values back to the graph from which it was developed.

Putting the Eigenvector centrality measure results back into Neo4j make they easier to query.

What troubles me is that Eigenvector centrality values are based only upon the recorded information we have for the graph.

There is no allowance for missing relationships or any validation of the Eigenvector centrality values found.

Recalling Paul Revere was a “terrorist” in his day, the NSA uses algorithms to declare nodes “important,” lack of access to courts for detainees, and Eigenvector centrality values start to look dangerous.

How would you validate Eigenvector centrality values? Not mathematically but against known values or facts outside of your graph.

How Important is Your Node in the Social Graph?

Filed under: Graphs,Mathematics,Networks,PageRank,Ranking — Patrick Durusau @ 6:08 pm

Java/JBLAS: Calculating eigenvector centrality of an adjacency matrix by Mark Needham.

OK, Mark’s title is more accurate but mine is more likely to get you to look beyond the headline. 😉

From the post:

I recently came across a very interesting post by Kieran Healy where he runs through a bunch of graph algorithms to see whether he can detect the most influential people behind the American Revolution based on their membership of various organisations.

The first algorithm he looked at was betweenness centrality which I’ve looked at previously and is used to determine the load and importance of a node in a graph.

This algorithm would assign a high score to nodes which have a lot of nodes connected to them even if those nodes aren’t necessarily influential nodes in the graph.

If we want to take the influence of the other nodes into account then we can use an algorithm called eigenvector centrality.

You may remember Kieran Healy’s post from Using Metadata to Find Paul Revere [In a Perfect World], where I pointed out that Kieran was using clean data. No omissions, no variant spellings, no confusion of any sort.

I suspect any sort of analysis would succeed with the proviso that it only gets clean data. Unlikely in an unclean data world.

But that to one side, Mark does a great job of assembling references on eigenvectors and code for processing. Follow all the resources in Mark’s post and you will have a much deeper understanding of this area.

Be sure to take note of the comparison between PageRank and Eigenvector centrality. Results are computational artifacts of choices that are visible when examining the end results.

PS: The Wikipedia link for Centrality cites Opsahl, Tore; Agneessens, Filip; Skvoretz, John (2010). “Node centrality in weighted networks: Generalizing degree and shortest paths“. Social Networks 32 (3): 245. doi:10.1016/j.socnet.2010.03.006 as a good summary. The link for the title leads to a preprint which is freely available.

Of collapsing in Solr

Filed under: Search Engines,Searching,Solr,Topic Maps — Patrick Durusau @ 4:35 pm

Of collapsing in Solr by Paul Masurel.

From the post:

This post is about the innerworkings of one of the two most popular open source search engines : Solr. I noticed that many questions (one or two everyday) on solr-user’s mailing list were about Solr’s collapsing functionality.

I thought it would be a good idea to explain how Solr’s collapsing is working. Because its documentation is very sparse, and because a search engine is the kind of car you to take a peek under the hood to make sure you’ll drive it right.

The Solr documentation at Apache refers to field collapsing and result grouping being “different ways to think about the same Solr feature.”

I read the post along with the Solr documentation.

BTW, note from “Known Limitations” in the Solr documentation:

Support for grouping on a multi-valued field has not yet been implemented.

That would be really nice with subjectIdentifier and subjectLocator having the potential to be sets of values.

Solr as an Analytics Platform

Filed under: Analytics,Search Engines,Solr — Patrick Durusau @ 4:19 pm

Solr as an Analytics Platform by Chris Becker.

From the post:

Here at Shutterstock we love digging into data. We collect large amounts of it, and want a simple, fast way to access it. One of the tools we use to do this is Apache Solr.

Most users of Solr will know it for its power as a full-text search engine. Its text analyzers, sorting, filtering, and faceting components provide an ample toolset for many search applications. A single instance can scale to hundreds of millions of documents (depending on your hardware), and it can scale even further through sharding. Modern web search applications also need to be fast, and Solr can deliver in this area as well.

The needs of a data analytics platform aren’t much different. It too requires a platform that can scale to support large volumes of data. It requires speed, and depends heavily on a system that can scale horizontally through sharding as well. And some of the main operations of data analytics – counting, slicing, and grouping — can be implemented using Solr’s filtering and faceting options.
(…)

A good introduction to obtaining useful results with Solr with a minimum of effort.

Certainly a good way to show ROI when you are convincing your manager to sponsor you for a Solr conference and/or training.

Source Code Search Engines [DYI Drones]

Filed under: Open Source,Programming,Software — Patrick Durusau @ 4:04 pm

Open Source Matters: 6 Source Code Search Engines You Can Use For Programming Projects by Saikat Basu.

From the post:

The Open source movement is playing a remarkable role in pushing technology and making it available to all. The success of Linux is also an example how open source can translate into a successful business model. Open source is pretty much mainstream now and in the coming years, it could have a major footprint across cutting edge educational technology and aerospace (think DIY drones).

Open source projects need all the help they can get. If not with funding, then with volunteers contributing to open source programming and free tools they can brandish. Search engines tuned with algorithms to find source code for programming projects are among the tools for the kit bag. While reusing code is a much debated topic in higher circles, they could be of help to beginner programmers and those trying to work their way through a coding logjam by cross-referencing their code. Here are six:

I don’t think any of these search engine will show up in comScore results. 😉

But they are search engines for a particular niche. And so free to optimize for their expected content, rather than trying to search everything. (Is there a lesson there?)

Which ones do you like best?

PS: On DYI drones, see: DIY DRONES – The Leading Community for Personal UAVs.

You may recall Abbie Hoffman saying in Steal this Book:

If you are around a military base, you will find it relatively easy to get your hands on an M-79 grenade launcher, which is like a giant shotgun and is probably the best self-defense weapon of all time. Just inquire discreetly among some long-haired soldiers.

Will DYI drones replace the M-79 grenade launcher as the “best” self-defense weapon?

Google Search Operators [Improving over Google]

Filed under: Search Behavior,Searching,Topic Maps — Patrick Durusau @ 3:46 pm

How To Make Good Use Of Google’s Search Operators by Craig Snyder.

From the post:

Some of you might not have the slightest clue what an operator is, in terms of using a search engine. Luckily enough, both Google and MakeUseOf offer some pretty good examples of how to use them with the world’s most popular search engine. In plain English, an operator is a tag that you can include within your Google search to make it more precise and specific.

With operators, you’re able to display results that pertain only to certain websites, search through a range of numbers, or even completely exclude a word from your results. When you master the use of Google’s search engine, finding the answer to nearly anything you can think of is a power that you have right at your fingertips. In this article, let’s make that happen.

8 Google Search Tips To Keep Handy At All Times by Dave Parrack.

From the post:

Google isn’t the only game in town when it comes to search. Alternatives such as Bing, DuckDuckGo, and Wolphram Alpha also provide the tools necessary to search the Web. However, the figures don’t lie, and the figures suggest that the majority of Internet users choose Google over the rest of the competition.

With that in mind it’s important to make sure all of those Google users are utilizing all that Google has to offer when it comes to its search engine. Everyone knows how to conduct a normal search by typing some words and/or a phrase into the box provided and following the links that emerge from the overcrowded fog. But Google Search offers a lot more than just the basics.

If friends or colleagues are using Google, I thought these posts might come in handy.

Speaking of the numbers, as of June 13, 2013, Google’s share of the search market was 66.7 percent. Bing was 17.9%, AOL, Inc. the smallest one listed, was at 1.3%. (What does that say to you about DuckDuckGo and Wolphram Alpha?)

Google’s majority share of the search market should be encouraging to anyone working on alternatives.

Why?

Google has left so much room for better search results.

For example, let’s say you find an article and you want to find other articles that rely on it. So you enter the title as a quoted phrase. What do you get back?

If it is a popular article, you may get hundreds of results. You and I both know you are not going to look at every article.

But a number of those articles are just citing the article of interest in a block of citations. Doesn’t have much to do with the results of the article at all.

But Google returns all of those, ranked for sure but you don’t know enough about the ranking to decide if two pages of search results is enough or not. Gold may be waiting on the third page. No way to tell.

Document level search results are just that. Document level search results. You can refine them for yourself but that’s not going to be captured by Google.

What is your example of improvement over the search results we get from Google now?

Wikidata RDF export available [And a tale of “part of.”]

Filed under: RDF,Wikidata — Patrick Durusau @ 3:04 pm

Wikidata RDF export available by Markus Krötzsch.

From the post:

I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.

Wikidata (homepage)

WikiData:Database download.

I read an article about combining data released under different licenses earlier today. No problems here because the data is released under Creative Commons CCO License. What for content in other namespaces. Different licensing may apply.

To run the Python script wda-export-data.py I had to install Python-bitarray, just in case you get an error message it is missing.

Use the data with caution.

The entry for Wikipedia reports in part:

part of     List of Wikimedia projects

If you follow “part of” you will find:

this item is a part of that item

Also known as:

section of
system of
subsystem of
subassembly of
sub-system of
sub-assembly of
merged into
contained within
assembly of
within a set

“[P]art of” covers enough semantic range to return Google-like results (bad).

Not to mention that as a subject, I think “Wikipedia” is a bit more than an entry in a list.

Don’t you?

August 12, 2013

Greek New Testament (with syntax trees)

Filed under: Bible,Data — Patrick Durusau @ 4:07 pm

Greek New Testament (with syntax trees)

If you are tired of the same old practice data sets, I may have a treat for you!

The Asia Bible Society has produced syntax tress for the New Testament, using the SBL Greek New Testament text.

To give you an idea of the granularity of the data, the first sentence in Matthew is spread over forty-nine (49) lines of markup.

Not big data in the usual sense but important data.

How videos go viral on Twitter – Three stories

Filed under: Advertising,Marketing,Topic Maps — Patrick Durusau @ 3:35 pm

How videos go viral on Twitter – Three stories by Gordon MacMillan.

From the post:

What is it that makes videos go viral? It is one of the big questions in digital marketing. While there is no single magic formula, we’ve come up with some key insights after tracking the stories behind three recent viral videos.
(…)

  1. Twitter users love video
  2. Videos are easily shareable
  3. Promoted products amplify your reach
  4. Get creative with Vine

See Gordon’s post for the details. Although I warn you up front that there is no special sauce that makes a video go viral.

What would you show about topic maps in six seconds?

Microsoft as Hadoop Leader

Filed under: Hadoop,Microsoft,REEF — Patrick Durusau @ 3:03 pm

Microsoft to open source a big data framework called REEF by Derrick Harris.

From the post:

Microsoft has developed a big data framework called REEF (a graciously simple acronym for Retainable Evaluator Execution Framework) that the company intends to open source in about a month. REEF is designed to run on top of YARN, the next-generation resource manager for Hadoop, and is particularly well suited for building machine learning jobs.

Microsoft Technical Fellow and CTO of Information Services Raghu Ramakrishnan explained REEF and Microsoft’s plans to open source it during a Monday morning keynote at the International Conference for Knowledge Mining and Data Discovery, taking place in Chicago.

YARN is a resource manager developed as part of the Apache Hadoop project that lets users run and manage multiple types of jobs (e.g., batch MapReduce, stream processing with Storm and/or a graph-processing package) atop the same cluster of physical machines. This makes it possible not only to consolidate the number of systems that an organization has to manage, but also to run different types of analysis on top of the same data from the same place. In some cases, the entire data workflow can be carried out on just one cluster of machines.

This is very good news!

In part because it furthers the development of the Hadoop ecosystem.

But also because it reinforces the Microsoft commitment to the Hadoop ecosystem.

If you think of TCP/IP as a roadway, consider the value of good and services moving along it.

Now think of the Hadoop ecosystem as another roadway.

An interoperable and high-speed roadway for data and data analysis.

Who has user facing applications that rely on data and data analysis? 😉

Here’s to hoping that MS doubles down on the Hadoop ecosystem!

Photographic Proof of a Subject?

Filed under: Graphics,Image Processing — Patrick Durusau @ 2:36 pm

Digitial photography brought photo manipulation within the reach of anyone with a computer. Not to mention lots of free publicity for Adobe’s Photoshop, as in the term photoshopping.

New ways to detect photoshopping are being developed.

Abstract:

We describe a geometric technique to detect physically inconsistent arrangements of shadows in an image. This technique combines multiple constraints from cast and attached shadows to constrain the projected location of a point light source. The consistency of the shadows is posed as a linear programming problem. A feasible solution indicates that the collection of shadows is physically plausible, while a failure to find a solution provides evidence of photo tampering. (Eric Kee, James F. O’Brien, and Hany Farid. “Exposing Photo Manipulation with Inconsistent Shadows“. ACM Transactions on Graphics, 32(4):28:1–12, September 2013. Presented at SIGGRAPH 2013.)

If your experience has been with “photoshopped” images of political candidates and obvious “gag” photos, consider that photo manipulation has a darker side:

Recent advances in computational photography, computer vision, and computer graphics allow for the creation of visually compelling photographic fakes. The resulting undermining of trust in photographs impacts law enforcement, national security, the media, advertising, e-commerce, and more. The nascent field of photo forensics has emerged to help restore some trust in digital photographs [Farid 2009] (from the introduction)

Beyond simple provenance, it could be useful to establish and associate with a photograph, analysis that supports its authenticity.

Exposing Photo Manipulation with Inconsistent Shadows. Webpage with extra resources.

Paper.

In case you had doubts, the technique is used by the authors to prove the Apollo lunar landing photo is not a fake.

PS: If images are now easy to use to misrepresent information, how much easier is it for textual data to be manipulated?

Thinking of those click-boxes, “yes, I agree to the terms of ….” on most websites.

August 11, 2013

Embedding Concepts in text for smarter searching with Solr4

Filed under: Concept Detection,Indexing,Searching,Solr — Patrick Durusau @ 7:08 pm

Embedding Concepts in text for smarter searching with Solr4 by Sujit Pal.

From the post:

Storing the concept map for a document in a payload field works well for queries that can treat the document as a bag of concepts. However, if you want to consider the concept’s position(s) in the document, then you are out of luck. For queries that resolve to multiple concepts, it makes sense to rank documents with these concepts close together higher than those which had these concepts far apart, or even drop them from the results altogether.

We handle this requirement by analyzing each document against our medical taxonomy, and annotating recognized words and phrases with the appropriate concept ID before it is sent to the index. At index time, a custom token filter similar to the SynonymTokenFilter (described in the LIA2 Book) places the concept ID at the start position of the recognized word or phrase. Resolved multi-word phrases are retained as single tokens – for example, the phrase “breast cancer” becomes “breast0cancer”. This allows us to rewrite queries such as “breast cancer radiotherapy”~5 as “2790981 2791965″~5.

One obvious advantage is that synonymy is implicitly supported with the rewrite. Medical literature is rich with synonyms and acronyms – for example, “breast cancer” can be variously called “breast neoplasm”, “breast CA”, etc. Once we rewrite the query, 2790981 will match against a document annotation that is identical for each of these various synonyms.

Another advantage is the increase of precision since we are dealing with concepts rather than groups of words. For example, “radiotherapy for breast cancer patients” would not match our query since “breast cancer patient” is a different concept than “breast cancer” and we choose the longest subsequence to annotate.

Yet another advantage of this approach is that it can support mixed queries. Assume that a query can only be partially resolved to concepts. You can still issue the partially resolved query against the index, and it would pick up the records where the pattern of concept IDs and words appear.

Finally, since this is just a (slightly rewritten) Solr query, all the features of standard Lucene/Solr proximity searches are available to you.

In this post, I describe the search side components that I built to support this approach. It involves a custom TokenFilter and a custom Analyzer that wraps it, along with a few lines of configuration code. The code is in Scala and targets Solr 4.3.0.

So if Solr4 can make documents smarter, can the same be said about topics?

Recalling that “document” for Solr is defined by your indexing, not some arbitrary byte count.

As we are indexing topics we could add information to topics to make merging more robust.

One possible topic map flow being:

Index -> addToTopics -> Query -> Results -> Merge for Display.

Yes?

Freely Available Images

Filed under: Graphics — Patrick Durusau @ 6:56 pm

A brief guide to the best sites for finding freely available images online

From the post:

I’m currently running a 23 Things self-directed learning programme at my University. One of the Things we just covered is Creative Commons images, and the best places to find them. I have a whole bunch of useful sites I draw people’s attentions to in the Presentations Skills course I run, so shared them all via the 23 Things blog – it got a lot of RTs when I tweeted about it, so as people found it so useful I thought I’d share it here. Finding good quality images is absolutely critical to pretty much all forms of marketing, after all!

If you want to avoid Death by PowerPoint presentations or to spruce up your blog posts, images are a necessity.

But searching the web randomly for safe (legally speaking) images to use may take more than a little time and effort.

The resources listed here are good sources for freely available images.

Don’t depend on obscurity to avoid image permission problems. That could be real embarrassing with a former prospective client.

I first saw this in Christophe Lalanne’s A bag of tweets / July 2013.

The Linux Command Line (2nd Edition)

Filed under: Linux OS — Patrick Durusau @ 6:47 pm

The Linux Command Line (2nd Edition) by William Shotts.

From the webpage:

Designed for the new command line user, this 537-page volume covers the same material as LinuxCommand.org but in much greater detail. In addition to the basics of command line use and shell scripting, The Linux Command Line includes chapters on many common programs used on the command line, as well as more advanced topics.

Free PDF as well as print version from No Starch Press.

Downloads issues tonight but from my memory of the first edition, this is a must-download volume.

I first saw this in Christophe Lalanne’s A bag of tweets / July 2013.

Exploring LinkedIn in Neo4j

Filed under: Graphs,Maps,Neo4j — Patrick Durusau @ 6:33 pm

Exploring LinkedIn in Neo4j by Rik Van Bruggen.

From the post:

Ever since I have been working for Neo, we have been trying to give our audience as many powerful examples of places where graph databases could really shine. And one of the obvious places has always been: social networks. That’s why I’ve written a post about facebook, and why many other graphistas have been looking at facebook and others to explain what could be done.

But while Facebook is probably the best-known social network, the one I use professionally the most is: LinkedIn. Some call it the creepiest network, but the fact of the matter is that professional network is, and has always been, a very useful way to get and stay in contact with other people from other organisations. And guess what: they do some fantastic stuff with their own, custom-developed graphs. One of these things is InMaps – a fantastic visualisation and colour coded analysis of your professional network. That’s where this blogpost got its inspiration from.

As Rik points out, you can view InMaps but you can do much else.

To fix that, Rik guides you through extracting data from InMaps and loading it into Neo4j.

For extra credit, try merging your data with data on the same people from other sources.

Could give you some insight into the problems faced by the NSA.

Purely Functional Photoshop [Functional Topic Maps?]

Filed under: Erlang,Functional Programming,Topic Maps — Patrick Durusau @ 10:33 am

Purely Functional Photoshop by James Hague.

From the post:

One of the first things you learn about Photoshop—or any similarly styled image editor—is to use layers for everything. Don’t modify existing images if you can help it. If you have a photo of a house and want to do some virtual landscaping, put each tree in its own layer. Want to add some text labels? More layers.

The reason is straightforward: you’re keeping your options open. You can change the image without overwriting pixels in a destructive way. If you need to save out a version of the image without labels, just hide that layer first. Maybe it’s better if the labels are slightly translucent? Don’t change the text; set the opacity of the layer.

This stuff about non-destructive operations sounds like something from a functional programming tutorial. It’s easy to imagine how all this layer manipulation could look behind the scenes. Here’s a list of layers using Erlang notation:

A great illustration of one aspect of functional programming using something quite familiar, Photoshop.

Imagine a set of topics and associations prior to any merging rules being applied. In one of the standard topic map syntaxes.

Wouldn’t applying merging rules as layers provide greater flexibility to explore what merging rules work best for a data set?

And wouldn’t opacity of topics and associations, to particular users, be a useful security measure?

Am I wrong in thinking the equivalent of layers would be a good next step for topic maps?

August 10, 2013

Chinese Agricultural Thesaurus published as Linked Open Data

Filed under: Agriculture,Linked Data — Patrick Durusau @ 2:56 pm

Chinese Agricultural Thesaurus published as Linked Open Data

From the post:

CAT is the largest agricultural domain thesaurus in China, which is held and maintained by AII of CAAS. CAT was the important fruit of more than 100 professionals’ six years hard work. The international and national standards were adopted while designing and constructing CAT. CAT covers areas including agriculture, forestry, biology, etc. It is organized in 40 main categories and contains more than 63 thousand concepts and most of them have English translation. In addition, CAT includes more than 130 thousand semantic relationships such as Use, UF, BT, NT and RT.

Not my favorite format but at least you can avoid a lot of tedious data entry.

Transformation and adding properties will take some effort but not as much as starting from scratch.

NSA-proof your e-mail in 2 hours

Filed under: Cybersecurity,Security — Patrick Durusau @ 2:49 pm

NSA-proof your e-mail in 2 hours by Drew Crawford.

From the post:

You may be concerned that the NSA is reading your e-mail. Is there really anything you can do about it though? After all, you don’t really want to move off of GMail / Google Apps. And no place you would host is any better.

Except, you know, hosting it yourself. The way that e-mail was originally designed to work. We’ve all just forgotten because, you know, webapps-n-stuff. It’s a lot of work, mkay, and I’m a lazy software developer.

(…)

So bookmark this blog post, block off a Saturday next month, and get it done. Seriously. If you are still using GMail (or Yahoo, or arbitrary US-based email company) in August, your right to complain about the NSA spying on you is revoked. If you’re complaining about government spying on the Internet, or in a gathering of programmers, and you won’t take basic steps to do anything about it, then you’re a hypocrite, full-stop. I will personally come to your terminal and demand the return of your complain license. Pick a weekend, get it done. Or just admit that you’re okay with it. Either way, just be consistent.

If you don’t already encrypt your email at your client, these instructions may prove to be a bit much for you. 😉

You do know Email Rule #1?

Never put anything in email that you would not want read to a federal grand jury or published on the front page of the New York Times.

The NSA cannot intercept what you do not send.

I keep thinking there must be ways to use topic maps for very secure communications. The message assembles by a process of merging.

But if you lack the appropriate merging rules, its just a jumble of words.

August 9, 2013

Hoya (HBase on YARN) : Application Architecture

Filed under: Hadoop YARN,HBase — Patrick Durusau @ 6:51 pm

Hoya (HBase on YARN) : Application Architecture by Steve Loughran.

From the post:

At Hadoop Summit in June, we introduced a little project we’re working on: Hoya: HBase on YARN. Since then the code has been reworked and is now up on Github. It’s still very raw, and requires some local builds of bits of Hadoop and HBase – but it is there for the interested.

In this article we’re going to look at the architecture, and a bit of the implementation.

We’re not going to look at YARN in this article -for that we have a dedicated section of the Hortonworks site -including sample chapters of Arun Murthy’s forthcoming book. Instead we’re going to cover how Hoya makes use of YARN.

If you are interested in where Hadoop is likely to go beyond MapReduce and don’t mind getting your hands dirty, this is for you.

Introducing Watchtower…

Filed under: Hadoop,Pig — Patrick Durusau @ 6:43 pm

Introducing Watchtower – Like Light Table for Pig by Thomas Millar.

From the post:

There are no two ways around it, Hadoop development iterations are slow. Traditional programmers have always had the benefit of re-compiling their app, running it, and seeing the results within seconds. They have near instant validation that what they’re building is actually working. When you’re working with Hadoop, dealing with gigabytes of data, your development iteration time is more like hours.

Inspired by the amazing real-time feedback experience of Light Table, we’ve built Mortar Watchtower to bring back that almost instant iteration cycle developers are used to. Not only that, Watchtower also helps surface the semantics of your Pig scripts, to give you insight into how your scripts are working, not just that they are working.

Instant Feedback

Watchtower is a daemon that sits in the background, continuously flowing a sample of your data through your script while your work. It captures what your data looks like, and shows how it mutates at each step, directly inline with your script.

I am not sure about the “…helps surface the semantics of your Pig scripts…,” but just checking scripts against data is a real boon.

I continue to puzzle over how the semantics of data and operations in Pig scripts should be documented.

Old style C comments seem out of place in 21st century programming.

I first saw this at Alex Popescu’s Watchtower – Instant feedback development tool for Pig.

TinkerPop 2.4.0 Released (Gremlin Without a Cause)

Filed under: Blueprints,Frames,Gremlin,Pipes,Rexster,TinkerPop — Patrick Durusau @ 6:36 pm

TinkerPop 2.4.0 Released (Gremlin Without a Cause) by Marko A. Rodriguez.

From the post:

TinkerPop 2.4.0 has been released under the name “Gremlin without a Cause” (see attached logo). The last release was back in March of 2013, so there are lots of new features/bugfixes/optimizations in the latest 2.4.0 release. Here is the best-of-the-best of each project along with the full release notes.

NOTE: 2.4.0 jars have been deployed to Apache Central Repo and ready for inclusion.

Another offering for your summer holiday enjoyment!

« Newer PostsOlder Posts »

Powered by WordPress