Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 19, 2015

DARPA: New and Updated Projects

Filed under: Cybersecurity,DARPA,Security — Patrick Durusau @ 1:38 pm

The DARPA Open Catalog lists two (2) updated projects and one (1) new project in addition to MEMEX.

New:

PLAN X: Plan X is a foundational cyberwarfare program to develop platforms for the Department of Defense to plan for, conduct, and assess cyberwarfare in a manner similar to kinetic warfare. Toward this end the program will bridge cyber communities of interest from academe, to the defense industrial base, to the commercial tech industry, to user-experience experts.

Plan X has three (3) sub-projects:

Mistral Compiler: Mistral is an experimental language and compiler for highly concurrent, distributed programming, with an emphasis on the specification and maintenance of a virtual overlay network and communication over that network. The current Mistral compiler supports the first generation of Mistral, so not all features we architected for the language are supported at present. This open source package includes our compiler and an interpreter. Use of Mistral for running programs on distributed systems requires a run-time system not included in this package. Thus this Mistral package allows only for experimentation with the language.

Lua Native Big Number Library: The PolarBN library is a big number library for use in cryptographic applications. PolarBN is a Lua wrapper for the bignum library in PolarSSL. The wrapper is written in C, using the standard Lua headers. Compared to the two standard Lua big number libraries, PolarBN is substantially faster than lbc, and does not require openssl-dev, as does lbn.

avro-golang-compiler: This repository contains a modification of the Avro Java compiler to generate golang code that uses the Avro C bindings to actually parse serialized Avro containers. (Java, C, Golang) (no link for this project)

Due to my lack of background in this area, I found the Plan X project description, such as: “…assess cyberwarfare in a manner similar to kinetic warfare,” rather opaque. Do they mean like physical “war games?” Please clue me in if you can.

In trying to find an answer, I did read the Mistral documentation, such as it was and ran across:

One challenge in programming at Internet scale is the development of languages in which to do this programming. For example, concurrency and control of it is one aspect where current languages fall short. A typical highly concurrent language such as Erlang can handle at most a few thousand concurrent processes in a computation, and requires substantial assumptions about reliable interconnection of all hosts involved in such computation. In contrast, languages for programming at Internet-scale should scale to handle millions of processes, yet be tolerant of highly dynamic network environments where both hosts and communication paths may come and go frequently during the lifetime of an application.

Any Erlangers care to comment?

Another source of puzzlement is how one would simulate a network with all its attendant vulnerabilities? In addition to varying versions and updates of software, there is the near constant interaction of users with remote resources, email, etc. Your “defense” may be perfect except for when “lite” colonels fall for phishing email scams. Unless they intend to simulate user behavior as well. Just curious.

Updated:

I say updated because DARPA says updated. I was unable to discover an easy way to tell which sub-parts were updated. I don’t have a screen shot of an earlier listing. But, for that its worth:

Active Authentication (AA): The Active Authentication (AA) program seeks to develop novel ways of validating the identity of computer users by focusing on the unique aspects of individuals through software-based biometrics. Biometrics are defined as the characteristics used to recognize individuals based on one or more intrinsic physical or behavioral traits. This program is focused on behavioral biometrics. [Seven (7) projects.]

XDATA: XDATA is developing an open source software library for big data to help overcome the challenges of effectively scaling to modern data volume and characteristics. The program is developing the tools and techniques to process and analyze large sets of imperfect, incomplete data. Its programs and publications focus on the areas of analytics, visualization, and infrastructure to efficiently fuse, analyze and disseminate these large volumes of data. [Eighty-three (83) projects so you can understand the difficulty in spotting the update.]

DARPA: MEMEX (Domain-Specific Search) Drops!

Filed under: DARPA,Search Engines,Searching,Tor — Patrick Durusau @ 12:49 pm

The DARPA MEMEX project is now listed on its Open Catalog page!

Forty (40) separate components listed by team, project, category, link to code, description and license. Each sortable of course.

No doubt DARPA has held back some of its best work but looking over the descriptions, there are no bojums or quantum leaps beyond current discussions in search technology. How far you can push the released work beyond its current state is an exercise for the reader.

Machine learning is mentioned in the descriptions for DeepDive, Formasaurus and SourcePin. No explicit mention of deep learning, at least in the descriptions.

If you prefer to not visit the DARPA site, I have gleaned the essential information (project, link to code, description) into the following list:

  • ACHE: ACHE is a focused crawler. Users can customize the crawler to search for different topics or objects on the Web. (Java)
  • Aperture Tile-Based Visual Analytics: New tools for raw data characterization of ‘big data’ are required to suggest initial hypotheses for testing. The widespread use and adoption of web-based maps has provided a familiar set of interactions for exploring abstract large data spaces. Building on these techniques, we developed tile based visual analytics that provide browser-based interactive visualization of billions of data points. (JavaScript/Java)
  • ArrayFire: ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its array-based function set makes parallel programming simple. ArrayFire’s multiple backends (CUDA, OpenCL, and native CPU) make it platform independent and highly portable. A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving users valuable time and lowering development costs. (C, C++, Python, Fortran, Java)
  • Autologin: AutoLogin is a utility that allows a web crawler to start from any given page of a website (for example the home page) and attempt to find the login page, where the spider can then log in with a set of valid, user-provided credentials to conduct a deep crawl of a site to which the user already has legitimate access. AutoLogin can be used as a library or as a service. (Python)
  • CubeTest: Official evaluation metric used for evaluation for TREC Dynamic Domain Track. It is a multiple-dimensional metric that measures the effectiveness of complete a complex and task-based search process. (Perl)
  • Data Microscopes: Data Microscopes is a collection of robust, validated Bayesian nonparametric models for discovering structure in data. Models for tabular, relational, text, and time-series data can accommodate multiple data types, including categorical, real-valued, binary, and spatial data. Inference and visualization of results respects the underlying uncertainty in the data, allowing domain experts to feel confident in the quality of the answers they receive. (Python, C++)
  • DataWake: The Datawake project consists of various server and database technologies that aggregate user browsing data via a plug-in using domain-specific searches. This captured, or extracted, data is organized into browse paths and elements of interest. This information can be shared or expanded amongst teams of individuals. Elements of interest which are extracted either automatically, or manually by the user, are given weighted values. (Python/Java/Scala/Clojure/JavaScript)
  • DeepDive: DeepDive is a new type of knowledge base construction system that enables developers to analyze data on a deeper level than ever before. Many applications have been built using DeepDive to extract data from millions of documents, Web pages, PDFs, tables, and figures. DeepDive is a trained system, which means that it uses machine-learning techniques to incorporate domain-specific knowledge and user feedback to improve the quality of its analysis. DeepDive can deal with noisy and imprecise data by producing calibrated probabilities for every assertion it makes. DeepDive offers a scalable, high-performance learning engine. (SQL, Python, C++)
  • DIG: DIG is a visual analysis tool based on a faceted search engine that enables rapid, interactive exploration of large data sets. Users refine their queries by entering search terms or selecting values from lists of aggregated attributes. DIG can be quickly configured for a new domain through simple configuration. (JavaScript)
  • Dossier Stack: Dossier Stack provides a framework of library components for building active search applications that learn what users want by capturing their actions as truth data. The frameworks web services and javascript client libraries enable applications to efficiently capture user actions such as organizing content into folders, and allows back end algorithms to train classifiers and ranking algorithms to recommend content based on those user actions. (Python/JavaScript/Java)
  • Dumpling: Dumpling implements a novel dynamic search engine which refines search results on the fly. Dumpling utilizes the Winwin algorithm and the Query Change retrieval Model (QCM) to infer the user’s state and tailor search results accordingly. Dumpling provides a friendly user interface for user to compare the static results and dynamic results. (Java, JavaScript, HTML, CSS)
  • FacetSpace: FacetSpace allows the investigation of large data sets based on the extraction and manipulation of relevant facets. These facets may be almost any consistent piece of information that can be extracted from the dataset: names, locations, prices, etc… (JavaScript)
  • Formasaurus: Formasaurus is a Python package that tells users the type of an HTML form: is it a login, search, registration, password recovery, join mailing list, contact form or something else. Under the hood it uses machine learning. (Python)
  • Frontera: Frontera (formerly Crawl Frontier) is used as part of a web crawler, it can store URLs and prioritize what to visit next. (Python)
  • HG Profiler: HG Profiler is a tool that allows users to take a list of entities from a particular source and look for those same entities across a pre-defined list of other sources. (Python)
  • Hidden Service Forum Spider: An interactive web forum analysis tool that operates over Tor hidden services. This tool is capable of passive forum data capture and posting dialog at random or user-specifiable intervals. (Python)
  • HSProbe (The Tor Hidden Service Prober): HSProbe is a python multi-threaded STEM-based application designed to interrogate the status of Tor hidden services (HSs) and extracting hidden service content. It is an HS-protocol savvy crawler, that uses protocol error codes to decide what to do when a hidden service is not reached. HSProbe tests whether specified Tor hidden services (.onion addresses) are listening on one of a range of pre-specified ports, and optionally, whether they are speaking over other specified protocols. As of this version, support for HTTP and HTTPS is implemented. Hsprobe takes as input a list of hidden services to be probed and generates as output a similar list of the results of each hidden service probed. (Python)
  • ImageCat: ImageCat analyses images and extracts their EXIF metadata and any text contained in the image via OCR. It can handle millions of images. (Python, Java)
  • ImageSpace: ImageSpace provides the ability to analyze and search through large numbers of images. These images may be text searched based on associated metadata and OCR text or a new image may be uploaded as a foundation for a search. (Python)
  • Karma: Karma is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs. Users integrate information by modelling it according to an ontology of their choice using a graphical user interface that automates much of the process. (Java, JavaScript)
  • LegisGATE: Demonstration application for running General Architecture Text Engineering over legislative resources. (Java)
  • Memex Explorer: Memex Explorer is a pluggable framework for domain specific crawls, search, and unified interface for Memex Tools. It includes the capability to add links to other web-based apps (not just Memex) and the capability to start, stop, and analyze web crawls using 2 different crawlers – ACHE and Nutch. (Python)
  • MITIE: Trainable named entity extractor (NER) and relation extractor. (C)
  • Omakase: Omakase provides a simple and flexible interface to share data, computations, and visualizations between a variety of user roles in both local and cloud environments. (Python, Clojure)
  • pykafka: pykafka is a Python driver for the Apache Kafka messaging system. It enables Python programmers to publish data to Kafka topics and subscribe to existing Kafka topics. It includes a pure-Python implementation as well as an optional C driver for increased performance. It is the only Python driver to have feature parity with the official Scala driver, supporting both high-level and low-level APIs, including balanced consumer groups for high-scale uses. (Python)
  • Scrapy Cluster: Scrapy Cluster is a scalable, distributed web crawling cluster based on Scrapy and coordinated via Kafka and Redis. It provides a framework for intelligent distributed throttling as well as the ability to conduct time-limited web crawls. (Python)
  • Scrapy-Dockerhub: Scrapy-Dockerhub is a deployment setup for Scrapy spiders that packages the spider and all dependencies into a Docker container, which is then managed by a Fabric command line utility. With this setup, users can run spiders seamlessly on any server, without the need for Scrapyd which typically handles the spider management. With Scrapy-Dockerhub, users issue one command to deploy spider with all dependencies to the server and second command to run it. There are also commands for viewing jobs, logs, etc. (Python)
  • Shadow: Shadow is an open-source network simulator/emulator hybrid that runs real applications like Tor and Bitcoin over a simulated Internet topology. It is light-weight, efficient, scalable, parallelized, controllable, deterministic, accurate, and modular. (C)
  • SMQTK: Kitware’s Social Multimedia Query Toolkit (SMQTK) is an open-source service for ingesting images and video from social media (e.g. YouTube, Twitter), computing content-based features, indexing the media based on the content descriptors, querying for similar content, and building user-defined searches via an interactive query refinement (IQR) process. (Python)
  • SourcePin: SourcePin is a tool to assist users in discovering websites that contain content they are interested in for a particular topic, or domain. Unlike a search engine, SourcePin allows a non-technical user to leverage the power of an advanced automated smart web crawling system to generate significantly more results than the manual process typically does, in significantly less time. The User Interface of SourcePin allows users to quickly across through hundreds or thousands of representative images to quickly find the websites they are most interested in. SourcePin also has a scoring system which takes feedback from the user on which websites are interesting and, using machine learning, assigns a score to the other crawl results based on how interesting they are likely to be for the user. The roadmap for SourcePin includes integration with other tools and a capability for users to actually extract relevant information from the crawl results. (Python, JavaScript)
  • Splash: Lightweight, scriptable browser as a service with an HTTP API. (Python)
  • streamparse: streamparse runs Python code against real-time streams of data. It allows users to spin up small clusters of stream processing machines locally during development. It also allows remote management of stream processing clusters that are running Apache Storm. It includes a Python module implementing the Storm multi-lang protocol; a command-line tool for managing local development, projects, and clusters; and an API for writing data processing topologies easily. (Python, Clojure)
  • TellFinder: TellFinder provides efficient visual analytics to automatically characterize and organize publicly available Internet data. Compared to standard web search engines, TellFinder enables users to research case-related data in significantly less time. Reviewing TellFinder’s automatically characterized groups also allows users to understand temporal patterns, relationships and aggregate behavior. The techniques are applicable to various domains. (JavaScript, Java)
  • Text.jl: Text.jl provided numerous tools for text processing optimized for the Julia language. Functionality supported include algorithms for feature extraction, text classification, and language identification. (Julia)
  • TJBatchExtractor: Regex based information extractor for online advertisements (Java).
  • Topic: This tool takes a set of text documents, filters by a given language, and then produces documents clustered by topic. The method used is Probabilistic Latent Semantic Analysis (PLSA). (Python)
  • Topic Space: Tool for visualization for topics in document collections. (Python)
  • Tor: The core software for using and participating in the Tor network. (C)
  • The Tor Path Simulator (TorPS): TorPS quickly simulates path selection in the Tor traffic-secure communications network. It is useful for experimental analysis of alternative route selection algorithms or changes to route selection parameters. (C++, Python, Bash)
  • TREC-DD Annotation: This Annotation Tool supports the annotation task in creating ground truth data for TREC Dynamic Domain Track. It adopts drag and drop approach for assessor to annotate passage-level relevance judgement. It also supports multiple ways of browsing and search in various domains of corpora used in TREC DD. (Python, JavaScript, HTML, CSS)

Beyond whatever use you find for the software, it is also important in terms of what capabilities are of interest to DARPA and by extension to those interested in militarized IT.

April 15, 2015

Maybe Friday (17th April) or Monday (20th April) DARPA – Dark Net

Filed under: DARPA,Search Analytics,Search Engines,Search Requirements,Uncategorized — Patrick Durusau @ 2:20 pm

Memex In Action: Watch DARPA Artificial Intelligence Search For Crime On The ‘Dark Web’ by Thomas Fox-Brewster.

Is DARPA’s Memex search engine a Google-killer? by Mark Stockleyhttps

A couple of “while you wait” pieces to read while you expect part of the DARPA Memex project to appear on its Open Catalog page, either this coming Friday (17th of April) or Monday (20th of April).

Fox-Brewster has video of a part of the system that:

It is trying to overcome one of the main barriers to modern search: crawlers can’t click or scroll like humans do and so often don’t collect “dynamic” content that appears upon an action by a user.

If you think searching is difficult now, with an estimated 5% of the web being indexed, just imagine bumping that up 10X or more.

Entirely manual indexing is already impossible and you have experienced the short comings of page ranking.

Perhaps the components of Memex will enable us to step towards a fusion of human and computer capabilities to create curated information resources.

Imagine an electronic The Art of Computer Programming that has several human experts per chapter who are assisted by deep searching and updating references and the text on an ongoing basis? So readers don’t have to weed through all the re-inventions of particular algorithms across numerous computer and math journals.

Or perhaps a more automated search of news reports so the earliest/most complete report is returned with the notation: “There are NNNNNN other, later and less complete versions of this story.” It isn’t that every major paper adds value, more often just content.

BTW, the focus on the capabilities of the search engine, as opposed to the analysis of those results most welcome.

See my post on its post-search capabilities: DARPA Is Developing a Search Engine for the Dark Web.

Looking forward to Friday or Monday!

February 11, 2015

Darpa Is Developing a Search Engine for the Dark Web

Filed under: Dark Data,DARPA,Search Engines — Patrick Durusau @ 7:19 pm

Darpa Is Developing a Search Engine for the Dark Web by Kim Zetter.

From the post:

A new search engine being developed by Darpa aims to shine a light on the dark web and uncover patterns and relationships in online data to help law enforcement and others track illegal activity.

The project, dubbed Memex, has been in the works for a year and is being developed by 17 different contractor teams who are working with the military’s Defense Advanced Research Projects Agency. Google and Bing, with search results influenced by popularity and ranking, are only able to capture approximately five percent of the internet. The goal of Memex is to build a better map of more internet content.

Reading how Darpa is build yet another bigger dirt pile, I was reminded of Rick Searle saying:


Rather than think of Big Data as somehow providing us with a picture of reality, “naturally emerging” as Mayer-Schönberger quoted above suggested we should start to view it as a way to easily and cheaply give us a metric for the potential validity of a hypothesis. And it’s not only the first step that continues to be guided by old fashioned science rather than computer driven numerology but the remaining steps as well, a positive signal followed up by actual scientist and other researchers doing such now rusting skills as actual experiments and building theories to explain their results. Big Data, if done right, won’t end up making science a form of information promising, but will instead be used as the primary tool for keeping scientist from going down a cul-de-sac.

The same principle applied to mass surveillance means a return to old school human intelligence even if it now needs to be empowered by new digital tools. Rather than Big Data being used to hoover up and analyze all potential leads, espionage and counterterrorism should become more targeted and based on efforts to understand and penetrate threat groups themselves. The move back to human intelligence and towards more targeted surveillance rather than the mass data grab symbolized by Bluffdale may be a reality forced on the NSA et al by events. In part due to the Snowden revelations terrorist and criminal networks have already abandoned the non-secure public networks which the rest of us use. Mass surveillance has lost its raison d’etre.

In particular because the project is designed to automatically discover “relationships:”


But the creators of Memex don’t want just to index content on previously undiscovered sites. They also want to use automated methods to analyze that content in order to uncover hidden relationships that would be useful to law enforcement, the military, and even the private sector. The Memex project currently has eight partners involved in testing and deploying prototypes. White won’t say who the partners are but they plan to test the system around various subject areas or domains. The first domain they targeted were sites that appear to be involved in human trafficking. But the same technique could be applied to tracking Ebola outbreaks or “any domain where there is a flood of online content, where you’re not going to get it if you do queries one at a time and one link at a time,” he says.

I for one am very sure the new system (I refuse to sully the historical name by associating it with this doomed DARPA project), will find relationships, many relationships in fact. Too many relationships for any organization, now matter how large, to sort the relevant for the irrelevant.

If you want to segment the task, you could say that data mining is charged with finding relationships.

However, the next step is data analysis is to evaluate the evidence for the relationships found in the preceding step.

The step after evaluating the evidence for relationships discovered is to determine what, if anything, those relationships mean to some question at hand.

In all but the simplest of cases, there will be even more steps than the ones I listed. All of which must occur before you have extracted reliable intelligence from the data mining exercise.

Having data to search is a first step. Searching for and finding relationships in data is another step. But if that is where the search trail ends, you have just wasted another $10 to $20 million that could have gone for worthwhile intelligence gathering.

March 23, 2014

XDATA – DARPA

Filed under: DARPA,Government Data,Open Source — Tags: — Patrick Durusau @ 7:00 pm

XDATA – DARPA

From the about page:

The DARPA Open Catalog is a list of DARPA-sponsored open source software products and related publications. Each resource link shown on this site links back to information about each project, including links to the code repository and software license information.

This site reorganizes the resources of the Open Catalog (specifically the XDATA program) in a way that is easily sortable based on language, project or team. More information about XDATA’s open source software toolkits and peer-reviewed publications can be found on the DARPA Open Catalog, located at http://www.darpa.mil/OpenCatalog/.

For more information about this site, e-mail us at piim@newschool.edu.

A great public service for anyone interested in DARPA XDATA projects.

You could view this as encouragement to donate time to government hackathons.

I disagree.

Donating services to an organization that pays for IT and then accepts crap results, encourages poor IT management.

Powered by WordPress