Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 22, 2015

MINIX 3

Filed under: Cybersecurity,Programming,Security — Patrick Durusau @ 3:29 pm

MINIX 3

From the read more page:

MINIX 3 is a free open-source operating system that can be used for studying operating systems, as a base for research projects, or for commercial (embedded) systems where microkernel systems dominate the market. Much of the focus on the project is on achieving high reliability through fault tolerance and self-healing techniques.

MINIX is based on a small (about 12K lines of code) microkernel that runs in kernel mode. The rest of the operating system runs as a collection of server processes, each one protected by the hardware MMU. These processes include the virtual file system, one or more actual file systems, the memory manager, the process manager, the reincarnation server, and the device drivers, each one running as a separate user-mode process.

One consequence of this design is that failures of the system due to bugs or attacks are isolated. For example, a failure or takeover of the audio driver due to a bug or exploit can lead to strange sounds but cannot lead to a full takeover of the operating system. Similarly, crashes of a system component can in many cases be automatically and transparently recovered without human intervention. Few, if any, other operating systems are as self-healing as MINIX 3.

Are you still using an insecure and bug-riddled OS?

Your choice.

Authorea

Filed under: Collaboration,Collaborative Annotation,Writing — Patrick Durusau @ 2:43 pm

Authorea is the collaborative typewriter for academia.

From the website:

Write on the web.
Writing a scientific article should be as easy as writing a blog post. Every document you create becomes a beautiful webpage, which you can share.
Collaborate.
More and more often, we write together. A recent paper coauthored on Authorea by a CERN collaboration counts over 200 authors. If we can solve collaboration for CERN, we can solve it for you too!
Version control.
Authorea uses Git, a robust versioning control system to keep track of document changes. Every edit you and your colleagues make is recorded and can be undone at any time.
Use many formats.
Authorea lets you write in LaTeX, Markdown, HTML, Javascript, and more. Different coauthors, different formats, same document.
Data-rich science.
Did you ever wish you could share with your readers the data behind a figure? Authorea documents can take data alongside text and images, such as IPython notebooks and d3.js plots to make your articles shine with beautiful data-driven interactive visualizations.

Sounds good? Sign up or log in to get started immediately, for free.

Authorea uses a gentle form of open source persuasion. You can have one (1) private article for free but unlimited public articles. As your monthly rate goes up, you can have an increased number of private articles. Works for me because most if not all of my writing/editing is destined to be public anyway.

Standards are most useful when they are writ LARGE so private or “secret” standards have never made sense to me.

Sharif University

Filed under: Cybersecurity,Education,Politics — Patrick Durusau @ 1:47 pm

Treadstone 71 continues to act as an unpaid (so far as I know) advertising agent for Sharif University.

From the university homepage:

Sharif University of Technology is one of the largest engineering schools in the Islamic Republic of Iran. It was established in 1966 under the name of Aryarmehr University of Technology and, at that time, there were 54 faculty members and a total of 412 students who were selected by national examination. In 1980, the university was renamed Sharif University of Technology. SUT now has a total of 300 full-time faculty members, approximately 430 part-time faculty members and a student body of about 12,000.

Treadstone 71 has published course notes from an Advanced Network Security course on honeypots.

In part Treadstone 71 comments:

There are many documents available on honeypot detection. Not too many are found as a Master’s course at University levels. Sharif University as part of the Iranian institutionalized efforts to build a cyber warfare capability for the government in conjunction with AmnPardaz, Ashiyane, and shadowy groups such as Ajax and the Iranian Cyber Army is highly focused on such an endeavor. With funding coming from the IRGC, infiltration of classes and as members of academia with Basij members, Sharif University is the main driver of information security and cyber operations in Iran. Below is another of many such examples. Honeypots and how to detect them is available for your review.

It is difficult to find a Master’s degree in CS that doesn’t include coursework on network security in general and honeypots in particular. I spot checked some of the degree’s offered by schools listed at: Best Online Master’s Degrees in Computer Science and found no shortage of information on honeypots.

I recognize the domestic (U.S.) political hysteria surrounding Iran but security decisions based on rumor and unfounded fears aren’t the best ones.

Project Naptha

Filed under: Image Processing,OCR — Patrick Durusau @ 10:18 am

Project Naptha – highlight, copy, and translate text from any image by Kevin Kwok.

From the webpage:

Project Naptha automatically applies state-of-the-art computer vision algorithms on every image you see while browsing the web. The result is a seamless and intuitive experience, where you can highlight as well as copy and paste and even edit and translate the text formerly trapped within an image.

The homepage has examples of Project Naptha being used on comics, scans, photos, diagrams, Internet memes, screenshots, along with sneak peeks at beta features, such as translation, erase text (from images) and change text. (You can select multiple regions with the shift key.)

This should be especially useful for journalists, bloggers, researchers, basically anyone who spends a lot of time looking for content on the Web.

If the project needs a slogan, I would suggest:

Naptha Frees Information From Image Prisons!

May 21, 2015

FBI Director Comey As Uninformed or Not Fair-Minded

Filed under: Cybersecurity,Security — Patrick Durusau @ 8:12 pm

Transcript: Comey Says Authors of Encryption Letter Are Uninformed or Not Fair-Minded by Megan Graham.

From the post:

Earlier today, FBI Director James Comey implied that a broad coalition of technology companies, trade associations, civil society groups, and security experts were either uninformed or were not “fair-minded” in a letter they sent to the President yesterday urging him to reject any legislative proposals that would undermine the adoption of strong encryption by US companies. The letter was signed by dozens of organizations and companies in the latest part of the debate over whether the government should be given built-in access to encrypted data (see, for example, here, here, here, and here for previous iterations).

The comments were made at the Third Annual Cybersecurity Law Institute held at Georgetown University Law Center. The transcript of his encryption-related discussion is below (emphasis added).

A group of tech companies and some prominent folks wrote a letter to the President yesterday that I frankly found depressing. Because their letter contains no acknowledgment that there are societal costs to universal encryption. Look, I recognize the challenges facing our tech companies. Competitive challenges, regulatory challenges overseas, all kinds of challenges. I recognize the benefits of encryption, but I think fair-minded people also have to recognize the costs associated with that. And I read this letter and I think, “Either these folks don’t see what I see or they’re not fair-minded.” And either one of those things is depressing to me. So I’ve just got to continue to have the conversation.

Governments have a long history of abusing citizens and data entrusted to them.

Director Comey is very uninformed if he is unaware of the role that Hollerith machines and census data played in the Holocaust.

Holland embraced Hollerith machines for its 1930’s census, with the same good intentions as FBI Director Comey for access to encrypted data.

France never made effective use of Hollerith machines prior to or during WWII.

What difference did that make?

Country Jewish Population Deported Murdered Death Ratio
Holland 140,000 107,000 102,000 73%
France 300,000 to 350,000 85,000 82,000 25%

(IBM and the Holocaust : the strategic alliance between Nazi Germany and America’s most powerful corporation by Edwin Black, at page 332.)

A fifty (50%) percent difference in the effectiveness of government oppression due to technology sounds like a lot to me. Bearing in mind that Comey wants to make government collection of data even more widespread and efficient.

If invoking the Holocaust seems like a reflex action, consider the Hollywood blacklist, the McCarthy era, government inflitration of the peace movement of the 1960’s, oppression of the Black Panther Party, long term discrimination against homosexuals, ongoing discrimination based on race, gender, religion, ethnicity, to name just a few instances of where government intrusion has been facilitated by its data gathering capabilities.

If anyone is being “not fair-minded” in the debate over encryption, it’s FBI Director Comey. The pages of history, past, recent and contemporary, bleed from the gathering and use of data by government at different levels.

Making a huge leap of faith and saying the current government is in good faith, in no way guarantees that a future government will be in good faith.

Strong encryption won’t save us from a corrupt government, but it may give us a fighting chance of a better government.

Machine-Learning Algorithm Mines Rap Lyrics, Then Writes Its Own

Filed under: Humor,Machine Learning,Music — Patrick Durusau @ 2:01 pm

Machine-Learning Algorithm Mines Rap Lyrics, Then Writes Its Own

From the post:

The ancient skill of creating and performing spoken rhyme is thriving today because of the inexorable rise in the popularity of rapping. This art form is distinct from ordinary spoken poetry because it is performed to a beat, often with background music.

And the performers have excelled. Adam Bradley, a professor of English at the University of Colorado has described it in glowing terms. Rapping, he says, crafts “intricate structures of sound and rhyme, creating some of the most scrupulously formal poetry composed today.”

The highly structured nature of rap makes it particularly amenable to computer analysis. And that raises an interesting question: if computers can analyze rap lyrics, can they also generate them?

Today, we get an affirmative answer thanks to the work of Eric Malmi at the University of Aalto in Finland and few pals. These guys have trained a machine-learning algorithm to recognize the salient features of a few lines of rap and then choose another line that rhymes in the same way on the same topic. The result is an algorithm that produces rap lyrics that rival human-generated ones for their complexity of rhyme.

The review is a fun read but I rather like the original paper title as well: DopeLearning: A Computational Approach to Rap Lyrics Generation by Eric Malmi, Pyry Takala, Hannu Toivonen, Tapani Raiko, Aristides Gionis.

Abstract:

Writing rap lyrics requires both creativity, to construct a meaningful and an interesting story, and lyrical skills, to produce complex rhyme patterns, which are the cornerstone of a good flow. We present a method for capturing both of these aspects. Our approach is based on two machine-learning techniques: the RankSVM algorithm, and a deep neural network model with a novel structure. For the problem of distinguishing the real next line from a randomly selected one, we achieve an 82 % accuracy. We employ the resulting prediction method for creating new rap lyrics by combining lines from existing songs. In terms of quantitative rhyme density, the produced lyrics outperform best human rappers by 21 %. The results highlight the benefit of our rhyme density metric and our innovative predictor of next lines.

You should also visit BattleBot (a rap engine):

BattleBot is a rap engine which allows you to “spit” any line that comes to your mind after which it will respond to you with a selection of rhyming lines found among 0.5 million lines from existing rap songs. The engine is based on a slightly improved version of the Raplyzer algorithm and the eSpeak speech synthesizer.

You can try out BattleBot simply by hitting “Spit” or “Random”. The latter will randomly pick a line among the whole database of lines and find the rhyming lines for that. The underlined part shows approximately the rhyming part of a result. To understand better, why it’s considered as a rhyme, you can click on the result, see the phonetic transcriptions of your line and the result, and look for matching vowel sequences starting from the end.

BTW, the MIT review concludes with:

What’s more, this and other raps generated by DeepBeat have a rhyming density significantly higher than any human rapper. “DeepBeat outperforms the top human rappers by 21% in terms of length and frequency of the rhymes in the produced lyrics,” they point out.

I can’t help but wonder when DeepBeat is going to hit the charts! 😉

Top 10 data mining algorithms in plain English

Filed under: Algorithms,Data Mining — Patrick Durusau @ 1:20 pm

Top 10 data mining algorithms in plain English by Raymond Li.

From the post:

Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.

What are we waiting for? Let’s get started!

Raymond covers:

  1. C4.5
  2. k-means
  3. Support vector machines
  4. Apriori
  5. EM
  6. PageRank
  7. AdaBoost
  8. kNN
  9. Naive Bayes
  10. CART
  11. Interesting Resources
  12. Now it’s your turn…

Would be nice if we all had a similar ability to explain algorithms!

Enjoy!

Twitter As Investment Tool

Filed under: Social Media,Social Networks,Social Sciences,Twitter — Patrick Durusau @ 12:44 pm

Social Media, Financial Algorithms and the Hack Crash by Tero Karppi and Kate Crawford.

Abstract:

@AP: Breaking: Two Explosions in the White House and Barack Obama is injured’. So read a tweet sent from a hacked Associated Press Twitter account @AP, which affected financial markets, wiping out $136.5 billion of the Standard & Poor’s 500 Index’s value. While the speed of the Associated Press hack crash event and the proprietary nature of the algorithms involved make it difficult to make causal claims about the relationship between social media and trading algorithms, we argue that it helps us to critically examine the volatile connections between social media, financial markets, and third parties offering human and algorithmic analysis. By analyzing the commentaries of this event, we highlight two particular currents: one formed by computational processes that mine and analyze Twitter data, and the other being financial algorithms that make automated trades and steer the stock market. We build on sociology of finance together with media theory and focus on the work of Christian Marazzi, Gabriel Tarde and Tony Sampson to analyze the relationship between social media and financial markets. We argue that Twitter and social media are becoming more powerful forces, not just because they connect people or generate new modes of participation, but because they are connecting human communicative spaces to automated computational spaces in ways that are affectively contagious and highly volatile.

Social sciences lag behind the computer sciences in making their publications publicly accessible as well as publishing behind firewalls so I can report on is the abstract.

On the other hand, I’m not sure how much practical advice you could gain from the article as opposed to the volumes of commentary following the incident itself.

The research reminds me of Malcolm Gladwell, author of The Tipping Point and similar works.

While I have greatly enjoyed several of Gladwell’s books, including the Tipping Point, it is one thing to look back and say: “Look, there was a tipping point.” It is quite another to be in the present and successfully say: “Look, there is a tipping point and we can make it tip this way or that.”

In retrospect, we all credit ourselves with near omniscience when our plans succeed and we invent fanciful explanations about what we knew or realized at the time. Others, equally skilled, dedicated and competent, who started at the same time, did not succeed. Of course, the conservative media (and ourselves if we are honest), invent narratives to explain those outcomes as well.

Of course, deliberate manipulation of the market with false information, via Twitter or not, is illegal. The best you can do is look for a pattern of news and/or tweets that result in downward changes in a particular stock, which then recovers and then apply that pattern more broadly. You won’t make $millions off of any one transaction but that is the sort of thing that draws regulatory attention.

LogJam – Postel’s Law In Action

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:31 am

The seriousness of the LogJam vulnerability was highlighted by John Leyden in Average enterprise ‘using 71 services vulnerable to LogJam’


Based on analysis of 10,000 cloud applications and data from more than 17 million global cloud users, cloud visibility firm Skyhigh Networks reckons that 575 cloud services are potentially vulnerable to man-in-the middle attacks. The average company uses 71 potentially vulnerable cloud services.

[Details from Skyhigh Networks]

The historical source of LogJam?


James Maude, security engineer at Avecto, said that the LogJam flaw shows how internet regulations and architecture decisions made more than 20 years ago are continuing to throw up problems.

“The LogJam issue highlights how far back the long tail of security stretches,” Maude commented. “As new technologies emerge and cryptography hardens, many simply add on new solutions without removing out-dated and vulnerable technologies. This effectively undermines the security model you are trying to build. Several recent vulnerabilities such as POODLE and FREAK have harnessed this type of weakness, tricking clients into using old, less secure forms of encryption,” he added.

Graham Cluley in Logjam vulnerability – what you need to know has better coverage of the history of weak encryption that resulted in the LogJam vulnerability.

What does that have to do with Postel’s Law?

TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others. [RFC761]

As James Maude noted earlier:

As new technologies emerge and cryptography hardens, many simply add on new solutions without removing out-dated and vulnerable technologies.

Probably not what Postel intended at the time but certainly more “robust” in one sense of the word, technologies remain compatible with other technologies that use vulnerable technologies.

In other words, robustness is responsible for the maintenance of weak encryption and hence the current danger from LogJam.

This isn’t an entirely new idea. Eric Allman (Sendmail), warns of security issues with Postel’s Law in The Robustness Principle Reconsidered: Seeking a middle ground:

In 1981, Jon Postel formulated the Robustness Principle, also known as Postel’s Law, as a fundamental implementation guideline for the then-new TCP. The intent of the Robustness Principle was to maximize interoperability between network service implementations, particularly in the face of ambiguous or incomplete specifications. If every implementation of some service that generates some piece of protocol did so using the most conservative interpretation of the specification and every implementation that accepted that piece of protocol interpreted it using the most generous interpretation, then the chance that the two services would be able to talk with each other would be maximized. Experience with the Arpanet had shown that getting independently developed implementations to interoperate was difficult, and since the Internet was expected to be much larger than the Arpanet, the old ad-hoc methods needed to be enhanced.

Although the Robustness Principle was specifically described for implementations of TCP, it was quickly accepted as a good proposition for implementing network protocols in general. Some have applied it to the design of APIs and even programming language design. It’s simple, easy to understand, and intuitively obvious. But is it correct.

For many years the Robustness Principle was accepted dogma, failing more when it was ignored rather than when practiced. In recent years, however, that principle has been challenged. This isn’t because implementers have gotten more stupid, but rather because the world has become more hostile. Two general problem areas are impacted by the Robustness Principle: orderly interoperability and security.

Eric doesn’t come to a definitive conclusion with regard to Postel’s Law but the general case is always difficult to decide.

However, the specific case, supporting encryption known to be vulnerable shouldn’t be.

If there were a programming principles liability checklist, one of the tick boxes should read:

___ Supports (list of encryption schemes), Date:_________

Lawyers doing discovery can compare lists of known vulnerabilities as of the date given for liability purposes.

Programmers would be on notice that supporting encryption with known vulnerabilities is opening the door to legal liability.

May 20, 2015

Format String Bug Exploration

Filed under: Cybersecurity,Programming,Security — Patrick Durusau @ 7:14 pm

Format String Bug Exploration by AJ Kumar.

From the post:

Abstract

The Format String vulnerability significantly introduced in year 2000 when remote hackers gain root access on host running FTP daemon which had anonymous authentication mechanism. This was an entirely new tactics of exploitation the common programming glitches behind the software, and now this deadly threat for the software is everywhere because programmers inadvertently used to make coding loopholes which are targeting none other than Format string attack. The format string vulnerability is an implication of misinterpreting the stack for handling functions with variable arguments especially in Printf function, since this article demonstrates this subtle bug in C programming context on windows operating system. Although, this class of bug is not operating system–specific as with buffer overflow attacks, you can detect vulnerable programs for Mac OS, Linux, and BSD. This article drafted to delve deeper at what format strings are, how they are operate relative to the stack, as well as how they are manipulated in the perspective of C programming language.

Essentials

To be cognizance with the format string bug explained in this article, you will require to having rudimentary knowledge of the C family of programming languages, as well as a basic knowledge of IA32 assembly over window operating system, by mean of visual studio development editor. Moreover, know-how about ‘buffer overflow’ exploitation will definitely add an advantage.

Format String Bug

The format string bug was first explained in June 2000 in a renowned journal. This notorious exploitation tactics enable a hacker to subvert memory stack protections and allow altering arbitrary memory segments by unsolicited writing over there. Overall, the sole cause behind happening is not to handle or properly validated the user-supplied input. Just blindly trusting the used supplied arguments that eventually lead to disaster. Subsequently, when hacker controls arguments of the Printf function, the details in the variable argument lists enable him to analysis or overwrite arbitrary data. The format string bug is unlike buffer overrun; in which no memory stack is being damaged, as well as any data are being corrupted at large extents. Hackers often execute this attack in context of disclosing or retrieving sensitive information from the stack for instance pass keys, cryptographic privates keys etc.

Now the curiosity around here is how exactly the hackers perform this deadly attack. Consider a program where we are trying to produce some string as “kmaraj” over the screen by employing the simple C language library Printf method as;

A bit deeper than most of my post on bugs but the lesson isn’t just the bug, but that it has persisted for going on fifteen (15) years now.

As a matter of fact, Karl Chen and David Wagner in Large-Scale Analysis of Format String Vulnerabilities in Debian Linux (2007) found:


We successfully analyze 66% of C/C++ source packages in the Debian 3.1 Linux distribution. Our system finds 1,533 format string taint warnings. We estimate that 85% of these are true positives, i.e., real bugs; ignoring duplicates from libraries, about 75% are real bugs.

We suggest that the technology exists to render format string vulnerabilities extinct in the near future. (emphasis added)

“…[N]ear future?” Maybe not because Mathias Payer and Thomas R. Gross report in 2013, String Oriented Programming: When ASLR is not Enough:

One different class of bugs has not yet received adequate attention in the context of DEP, stack canaries, and ASLR: format string vulnerabilities. If an attacker controls the first parameter to a function of the printf family, the string is parsed as a format string. Using such a bug and special format markers result in arbitrary memory writes. Existing exploits use format string vulnerabilities to mount stack or heap-based code injection attacks or to set up return oriented programming. Format string vulnerabilities are not a vulnerability of the past but still pose a significant threat (e.g., CVE-2012-0809 reports a format string bug in sudo and allows local privilege escalation; CVE-2012-1152 reports multiple format string bugs in perl-YAML and allows remote exploitation, CVE-2012-2369 reports a format string bug in pidgin-otr and allows remote exploitation) and usually result in full code execution for the attacker.

Should I assume in computer literature six (6) years doesn’t qualify as the “…near future?”

Would liability for string format bugs result in greater effort to avoid the same?

Hard to say in the abstract but the results could hardly be worse than fifteen (15) years of format string bugs.

Not to mention that liability would put the burden of avoiding the bug squarely on the shoulders of those best able to avoid it.

Math for Journalists Made Easy:…

Filed under: Journalism,News,Reporting,Statistics — Patrick Durusau @ 4:33 pm

Math for Journalists Made Easy: Understanding and Using Numbers and Statistics – Sign up now for new MOOC

From the post:

Journalists who squirm at the thought of data calculation, analysis and statistics can arm themselves with new reporting tools during the new Massive Open Online Course (MOOC) from the Knight Center for Journalism in the Americas: “Math for Journalists Made Easy: Understanding and Using Numbers and Statistics” will be taught from June 1 to 28, 2015.

Click here to sign up and to learn more about this free online course.

“Math is crucial to things we do every day. From covering budgets to covering crime, we need to understand numbers and statistics,” said course instructor Jennifer LaFleur, senior editor for data journalism for the Center for Investigative Reporting, one of the instructors of the MOOC.

Two other instructors will be teaching this MOOC: Brant Houston, a veteran investigative journalist who is a professor and the Knight Chair in Journalism at the University of Illinois; and freelance journalists Greg Ferenstein, who specializes in the use of numbers and statistics in news stories.

The three instructors will teach journalists “how to be critical about numbers, statistics and research and to avoid being improperly swayed by biased researchers.” The course will also prepare journalists to relay numbers and statistics in ways that are easy for the average reader to understand.

“It is true that many of us became journalists because sometime in our lives we wanted to escape from mathematics, but it is also true that it has never been so important for journalists to overcome any fear or intimidation to learn about numbers and statistics,” said professor Rosental Alves, founder and director of the Knight Center. “There is no way to escape from math anymore, as we are nowadays surrounded by data and we need at least some basic knowledge and tools to understand the numbers.”

The MOOC will be taught over a period of four weeks, from June 1 to 28. Each week focuses on a particular topic taught by a different instructor. The lessons feature video lectures and are accompanied by readings, quizzes and discussion forums.

This looks excellent.

I will be looking forward to very tough questions of government and corporate statistical reports from anyone who takes this course.

A Call for Collaboration: Data Mining in Cross-Border Investigations

Filed under: Journalism,Knowledge Management,News,Reporting — Patrick Durusau @ 4:12 pm

A Call for Collaboration: Data Mining in Cross-Border Investigations by Jonathan Stray and Drew Sullivan.

From the post:

Over the past few years we have seen the huge potential of data and document mining in investigative journalism. Tech savvy networks of journalists such as the Organized Crime and Corruption Reporting Project (OCCRP) and the International Consortium of Investigative Journalists (ICIJ) have teamed together for astounding cross-border investigations, such as OCCRP’s work on money laundering or ICIJ’s offshore leak projects. OCCRP has even incubated its own tools, such as VIS, Investigative Dashboard and Overview.

But we need to do better. There is enormous duplication and missed opportunity in investigative journalism software. Many small grants for technology development have led to many new tools, but very few have become widely used. For example, there are now over 70 tools just for social network analysis. There are other tools for other types of analysis, document handling, data cleaning, and on and on. Most of these are open source, and in various states of completeness, usability, and adoption. Developer teams lack critical capacities such as usability testing, agile processes, and business development for sustainability. Many of these tools are beautiful solutions in search of a problem.

The fragmentation of software development for investigative journalism has consequences: Most newsrooms still lack capacity for very basic knowledge management tasks, such as digitally filing new documents where they can be searched and found later. Tools do not work or do not inter-operate. Ultimately the reporting work is slower, or more expensive, or doesn’t get done. Meanwhile, the commercial software world has so far ignored investigative journalism because it is a small, specialized user-base. Tools like Nuix and Palantir are expensive, not networked, and not extensible for the inevitable story-specific needs.

But investigative journalists have learned how to work in cross-border networks, and investigative journalism developers can too. The experience gained from collaborative data-driven journalism has led OCCRP and other interested organizations to focus on the following issues:

The issues:

  • Usability
  • Delivery
  • Networked Investigation
  • Sustainability
  • Interoperability and extensibility

The next step is reported to be:

The next step for us is a small meeting: the very first conference on Knowledge Management in Investigative Journalism. This event will bring together key developers and journalists to refine the problem definition and plan a way forward. OCCRP and the Influence Mappers project have already pledged support. Stay tuned…

Jonathan Stray jonathanstray@gmail.comand and Drew Sullivan drew@occrp.org, want to know if you are interested too?

See the original post, email Jonathan and Drew if you are interested. It sounds like a very good idea to me.

PS: You already know one of the technologies that I think is important for knowledge management: topic maps!

H2O 3.0

Filed under: H20,Machine Learning — Patrick Durusau @ 3:03 pm

H20 3.0

From the webpage:

Why H2O?

H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFastTM Scoring Engine.

Get H2O!

What is H2O?

H2O makes it possible for anyone to easily apply math and predictive analytics to solve today’s most challenging business problems. It intelligently combines unique features not currently found in other machine learning platforms including:

  • Best of Breed Open Source Technology – Enjoy the freedom that comes with big data science powered by OpenSource technology. H2O leverages the most popular OpenSource products like ApacheTM Hadoop® and SparkTM to give customers the flexibility to solve their most challenging data problems.
  • Easy-to-use WebUI and Familiar Interfaces – Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
  • Data Agnostic Support for all Common Database and File Types – Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
  • Massively Scalable Big Data Analysis – Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
  • Real-time Data Scoring – Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.

Note the caveat near the bottom of the page:

With H2O, you can:

  • Make better predictions. Harness sophisticated, ready-to-use algorithms and the processing power you need to analyze bigger data sets, more models, and more variables.
  • Get started with minimal effort and investment. H2O is an extensible open source platform that offers the most pragmatic way to put big data to work for your business. With H2O, you can work with your existing languages and tools. Further, you can extend the platform seamlessly into your Hadoop environments.

The operative word being “can.” Your results with H2O depend upon your knowledge of machine learning, knowledge of your data and the effort you put into using H2O, among other things.

Bin Laden’s Bookshelf

Filed under: Intelligence,Security — Patrick Durusau @ 2:20 pm

Bin Laden’s Bookshelf

From the webpage:

On May 20, 2015, the ODNI released a sizeable tranche of documents recovered during the raid on the compound used to hide Usama bin Ladin. The release, which followed a rigorous interagency review, aligns with the President’s call for increased transparency–consistent with national security prerogatives–and the 2014 Intelligence Authorization Act, which required the ODNI to conduct a review of the documents for release.

The release contains two sections. The first is a list of non-classified, English-language material found in and around the compound. The second is a selection of now-declassified documents.

The Intelligence Community will be reviewing hundreds more documents in the near future for possible declassification and release. An interagency taskforce under the auspices of the White House and with the agreement of the DNI is reviewing all documents which supported disseminated intelligence cables, as well as other relevant material found around the compound. All documents whose publication will not hurt ongoing operations against al-Qa‘ida or their affiliates will be released.

From the website:

bin-laden-bookcase

The one expected work missing from Bin Laden’s library?

The Anarchist Cookbook!

Possession of the same books as Bin Laden will be taken as a sign terrorist sympathies. Weed your collection responsibly.

Political Futures Tracker

Filed under: Government,Natural Language Processing,Politics,Sentiment Analysis — Patrick Durusau @ 2:01 pm

Political Futures Tracker.

From the webpage:

The Political Futures Tracker tells us the top political themes, how positive or negative people feel about them, and how far parties and politicians are looking to the future.

This software will use ground breaking language analysis methods to examine data from Twitter, party websites and speeches. We will also be conducting live analysis on the TV debates running over the next month, seeing how the public respond to what politicians are saying in real time. Leading up to the 2015 UK General Election we will be looking across the political spectrum for emerging trends and innovation insights.

If that sounds interesting, consider the following from: Introducing… the Political Futures Tracker:

We are exploring new ways to analyse a large amount of data from various sources. It is expected that both the amount of data and the speed that it is produced will increase dramatically the closer we get to election date. Using a semi-automatic approach, text analytics technology will sift through content and extract the relevant information. This will then be examined and analysed by the team at Nesta to enable delivery of key insights into hotly debated issues and the polarisation of political opinion around them.

The team at the University of Sheffield has extensive experience in the area of social media analytics and Natural Language Processing (NLP). Technical implementation has started already, firstly with data collection which includes following the Twitter accounts of existing MPs and political parties. Once party candidate lists become available, data harvesting will be expanded accordingly.

In parallel, we are customising the University of Sheffield’s General Architecture for Text Engineering (GATE); an open source text analytics tool, in order to identify sentiment-bearing and future thinking tweets, as well as key target topics within these.

One thing we’re particularly interested in is future thinking. We describe this as making statements concerning events or issues in the future. Given these measures and the views expressed by a certain person, we can model how forward thinking that person is in general, and on particular issues, also comparing this with other people. Sentiment, topics, and opinions will then be aggregated and tracked over time.

Personally I suspect that “future thinking” is used in difference senses by the general population and political candidates. For a political candidate, however the rhetoric is worded, the “future” consists of reaching election day with 50% plus 1 vote. For the general population, the “future” probably includes a longer time span.

I mention this in case you can sell someone on the notion that what political candidates say today has some relevance to what they will do after election. President Obmana has been in office for six (6) years on office, the Guantanamo Bay detention camp remains open, no one has been held accountable for years of illegal spying on U.S. citizens, banks and other corporate interests have all but been granted keys to the U.S. Treasury, to name a few items inconsistent with his previous “future thinking.”

Unless you accept my suggestion that “future thinking” for a politician means election day and no further.

Analysis of named entity recognition and linking for tweets

Filed under: Entity Resolution,Natural Language Processing,Tweets — Patrick Durusau @ 1:29 pm

Analysis of named entity recognition and linking for tweets by Leon Derczynski, et al.

Abstract:

Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

The questions addressed by the paper are:


RQ1 How robust are state-of-the-art named entity recognition and linking methods on short and noisy microblog texts?

RQ2 What problem areas are there in recognising named entities in microblog posts, and what are the major causes of false negatives and false positives?

RQ3 Which problems need to be solved in order to further the state-of-the-art in NER and NEL on this difficult text genre?

The ultimate conclusion is that entity recognition in microblog posts falls short of what has been achieved for newswire text but if you need results now or at least by tomorrow, this is a good guide to what is possible and where improvements can be made.

Detecting Deception Strategies [Godsend for the 2016 Election Cycle]

Filed under: Language,Natural Language Processing,Politics — Patrick Durusau @ 10:20 am

Discriminative Models for Predicting Deception Strategies by Scott Appling, Erica Briscoe, C.J. Hutto.

Abstract:

Although a large body of work has previously investigated various cues predicting deceptive communications, especially as demonstrated through written and spoken language (e.g., [30]), little has been done to explore predicting kinds of deception. We present novel work to evaluate the use of textual cues to discriminate between deception strategies (such as exaggeration or falsifi cation), concentrating on intentionally untruthful statements meant to persuade in a social media context. We conduct human subjects experimentation wherein subjects were engaged in a conversational task and then asked to label the kind(s) of deception they employed for each deceptive statement made. We then develop discriminative models to understand the difficulty between choosing between one and several strategies. We evaluate the models using precision and recall for strategy prediction among 4 deception strategies based on the most relevant psycholinguistic, structural, and data-driven cues. Our single strategy model results demonstrate as much as a 58% increase over baseline (random chance) accuracy and we also find that it is more difficult to predict certain kinds of deception than others.

The deception strategies studied in this paper:

  • Falsification
  • Exaggeration
  • Omission
  • Misleading

especially omission, will form the bulk of the content in the 2016 election cycle in the United States. Only deceptive statements were included in the test data, so the models were tested on correctly recognizing the deception strategy in a known deceptive statement.

The test data is remarkably similar to political content, which aside from their names and names of their opponents (mostly), is composed entirely of deceptive statements, albeit not marked for the strategy used in each one.

A web interface for loading pointers to video, audio or text with political content that emits tagged deception with pointers to additional information would be a real hit for the next U.S. election cycle. Monetize with ads, the sources of additional information, etc.

I first saw this in a tweet by Leon Derczynski.

New Computer Bug Exposes Broad Security Flaws [Trust but Verify]

Filed under: Cybersecurity,Security — Patrick Durusau @ 6:23 am

New Computer Bug Exposes Broad Security Flaws by Jennifer Valentino-Devries.

From the post:

A dilemma this spring for engineers at big tech companies, including Google Inc., Apple Inc. and Microsoft Corp., shows the difficulty of protecting Internet users from hackers.

Internet-security experts crafted a fix for a previously undisclosed bug in security tools used by all modern Web browsers. But deploying the fix could break the Internet for thousands of websites.

“It’s a twitchy business, and we try to be careful,” said Richard Barnes, who worked on the problem as the security lead for Mozilla Corp., maker of the Firefox Web browser. “The question is: How do you come up with a solution that gets as much security as you can without causing a lot of disruption to the Internet?”

Engineers at browser makers traded messages for two months, ultimately choosing a fix that could make more than 20,000 websites unreachable. All of the browser makers have released updates including the fix or will soon, company representatives said.
No links or pointers to further resources.

The name of this new bug is “Logjam.”

I saw Jennifer’s story on Monday evening, about 19:45 ESDT and tried to verify the story with some of the standard bug reporting services.

No “hits” at CERT, IBM’s X-Force, or the Internet Storm Center as of 20:46 ESDT on May 19, 2015.

The problem being that Jennifer did not include any links to any source that would verify the existence of this new bug. Not one.

The only story that kept popping up in searches was Jennifer’s.

So, I put this post to one side, returning to it this morning.

As of this morning, now about 6:55 ESDT, the Internet Storm Center returns:

Logjam – vulnerabilities in Diffie-Hellman key exchange affect browsers and servers using TLS by Brad Duncan, ISC Handler and Security Researcher at Rackspace, with a pointer to: The Logjam Attack, which reads in part:

We have published a technical report, Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice, which has specifics on these attacks, details on how we broke the most common 512-bit Diffie-Hellman Group, and measurements of who is affected. We have also published several proof of concept demos and a Guide to Deploying Diffie-Hellman for TLS

This study was performed by computer scientists at Inria Nancy-Grand Est, Inria Paris-Rocquencourt, Microsoft Research, Johns Hopkins University, University of Michigan, and the University of Pennsylvania: David Adrian, Karthikeyan Bhargavan, Zakir Durumeric, Pierrick Gaudry, Matthew Green, J. Alex Halderman, Nadia Heninger, Drew Springall, Emmanuel Thomé, Luke Valenta, Benjamin VanderSloot, Eric Wustrow, Santiago Zanella-Beguelin, and Paul Zimmermann. The team can be contacted at weak-dh@umich.edu.

As of 7:06 ESDT on May 20, 2015, neither CERT nor IBM’s X-Force returns any “hits” on “Logjam.”

It is one thing to “trust” a report of a bug, but please verify before replicating a story based upon insider gossip. Links to third party materials for example.

May 19, 2015

Fighting Cybercrime at IBM

Filed under: Cybersecurity,Security — Patrick Durusau @ 6:46 pm

15/05/15 – More than 1,000 Organizations Join IBM to Battle Cybercrime

From the post:

ARMONK, NY – 14 May 2015: IBM (NYSE: IBM) today announced that more than 1,000 organizations across 16 industries are participating in its X-Force Exchange threat intelligence network, just one month after its launch. IBM X-Force Exchange provides open access to historical and real-time data feeds of threat intelligence, including reports of live attacks from IBM’s global threat monitoring network, enabling enterprises to defend against cybercrime.

IBM’s new cloud-based cyberthreat network, powered by IBM Cloud, is designed to foster broader industry collaboration by sharing actionable data to defend against these very real threats to businesses and governments. The company provided free access last month, via the X-Force Exchange, to its 700 terabyte threat database – a volume equivalent to all data that flows across the internet in two days. This includes two decades of malicious cyberattack data from IBM, as well as anonymous threat data from the thousands of organizations for which IBM manages security operations. Participants have created more than 300 new collections of threat data in the last month alone.

“Cybercrime has become the equivalent of a pandemic — no company or country can battle it alone,” said Brendan Hannigan, General Manager, IBM Security. ““We have to take a collective and collaborative approach across the public and private sectors to defend against cybercrime. Sharing and innovating around threat data is central to battling highly organized cybercriminals; the industry can no longer afford to keep this critical resource locked up in proprietary databases. With X-Force Exchange, IBM has opened access to our extensive threat data to advance collaboration and help public and private enterprises safeguard themselves.”

Think about the numbers for a moment, 1,000 organizations and 300 new collections of threat data in a month. Not bad by anyone’s yardstick.

As I titled my first post on the X-Force Exchange: Being Thankful IBM is IBM.

Civil War Navies Bookworm

Filed under: History,Humanities,Indexing,Ngram Viewer,Searching,Text Analytics — Patrick Durusau @ 6:39 pm

Civil War Navies Bookworm by Abby Mullen.

From the post:

If you read my last post, you know that this semester I engaged in building a Bookworm using a government document collection. My professor challenged me to try my system for parsing the documents on a different, larger collection of government documents. The collection I chose to work with is the Official Records of the Union and Confederate Navies. My Barbary Bookworm took me all semester to build; this Civil War navies Bookworm took me less than a day. I learned things from making the first one!

This collection is significantly larger than the Barbary Wars collection—26 volumes, as opposed to 6. It encompasses roughly the same time span, but 13 times as many words. Though it is still technically feasible to read through all 26 volumes, this collection is perhaps a better candidate for distant reading than my first corpus.

The document collection is broken into geographical sections, the Atlantic Squadron, the West Gulf Blockading Squadron, and so on. Using the Bookworm allows us to look at the words in these documents sequentially by date instead of having to go back and forth between different volumes to get a sense of what was going on in the whole navy at any given time.

Before you ask:

The earlier post: Text Analysis on the Documents of the Barbary Wars

More details on Bookworm.

As with all ngram viewers, exercise caution in assuming a text string has uniform semantics across historical, ethnic, or cultural fault lines.

Mile High Club

Filed under: Security — Patrick Durusau @ 6:20 pm

Mile High Club by Chris Rouland.

From the post:

A very elite club was just created by Chris Roberts, if his allegations of commandeering an airplane are true. Modern day transportation relies heavily on remote access to the outside world…and consumer trust. These two things have been at odds recently, ever since the world read a tweet from Chris Roberts, in which he jokingly suggested releasing oxygen masks while aboard a commercial flight. Whether or not Roberts was actually joking about hacking the aircraft is up for debate, but the move led the Government Accountability Office to issue a warning about potential vulnerabilities to aircraft systems via in-flight Wi-Fi.

Chris has a great suggestion:


While I agree that we don’t want every 16-year-old script kiddie trying to tamper with people’s lives at 35,000 feet, we do wonder if United or any of the other major carriers would be willing to park a plane at Black Hat. Surely if they were certain that there is no way to exploit the pilot’s aviation systems, they would be willing to allow expert researchers to have a look while the plane is on the ground? Tremendous insight and overall global information security could only improve if a major carrier or manufacturer hosted a hack week on a Dreamliner on the tarmac at McCarran international.

A couple of candidate Black Hat conferences:

BLACK HAT | USA August 1-6, 2015 | Mandalay Bay | Las Vegas, NV

BLACK HAT | EUROPE November 10-13, 2015 | Amsterdam RAI | The Netherlands

Do you think the conference organizers would comp registration for the people who come with the plane?

As far as airlines, The top ten (10) in income (US) for 2014:

  1. Delta
  2. United
  3. American
  4. Southwest
  5. US Airways
  6. JetBlue
  7. Alaska
  8. Hawaiian
  9. Spirit
  10. SkyWest

When you register for a major Black Hat conference, ask the organizers to stage an airline hacking event. Especially on:

Big Black Hat Conference signs with the make/model on them for the tarmac.

Would make an interesting press event to have a photo of the conference sign with no plane.

Sorta speaks for itself. Yes?

The Back-to-Basics Readings of 2012

Filed under: Computer Science,Distributed Systems — Patrick Durusau @ 4:46 pm

The Back-to-Basics Readings of 2012 by Werner Vogels (CTO – Amazon.com).

From the post:

After the AWS re: Invent conference I spent two weeks in Europe for the last customer visits of the year. I have since returned and am now in New York City enjoying a few days of winding down the last activities of the year before spending the holidays here with family. Do not expect too many blog posts or twitter updates. Although there are still a few very exciting AWS news updates to happen this year.

I thought this was a good moment to collect all the readings I suggested this year in one summary post. It was not until later in the year that I started to recording the readings here on the blog, so I hope this is indeed the complete list. I am pretty sure some if not all of these papers deserved to be elected to the hall of fame of best papers in distributed systems.

My count is twenty-four (24) papers. More than enough for a weekend at the beach! 😉

I first saw this in a tweet by Computer Science.

NY police commissioner wants 450 more cops to hunt Unicorns

Filed under: Government,Security — Patrick Durusau @ 3:57 pm

NY police commissioner wants 450 more cops to fight against jihadis by Robert Spencer.

OK, not exactly what Police Commissioner Bratton said but it may as well be.

Bratton is quoted in the post as saying:

As the fear over the threat of terrorism continues to swell around the world, New York City becomes increasingly on edge that it’s time to take extra security precautions.

Although we have not experienced the caliber such as the attacks in Paris, we have a history of being a major target and ISIS has already begun to infiltrate young minds through the use of video games and social media.

Since the New Year there have been numerous arrests in Brooklyn and Queens for people attempting to assist ISIS from afar, building homemade bombs and laying out plans of attack.

This is called no-evidence police policy.

Particularly when you examine the “facts” behind:

…numerous arrests in Brooklyn and Queens for people attempting to assist ISIS from afar, building homemade bombs and laying out plans of attack.

Queens? Oh, yes, the two women from Queens who an FBI informant befriended by finding a copy of the Anarchist Cookbook online and printing it out for them. Not to mention taking one of them shopping for bomb components. The alleged terrorists were going to educate themselves on bomb making. More a threat to themselves than anyone else. e451 and Senator Dianne Feinstein The focus of that post is on the Anarchist Cookbook but you can get the drift of how silly the compliant was in fact.

As far as Brooklyn, you can read the complaint for yourself but the gist of it was one of the three defendants could not travel because his mother would not give him his passport. Serious terrorist people we are dealing with here. The other two were terrorist wannabe’s who long on boosting skills but there isn’t a shortage of boosters in ISIS. Had they been able to connect with ISIS by some happenstance, it would have degraded the operational capabilities of ISIS, not assisted it.

A recent estimate for the Muslim population of New York puts the total number of Muslims at 600,000. 175 Mosques in New York City and counting. Muslims in New York City, Part II [2010]

The FBI was able to assist and prompt five (5) people out of an estimated 600,000 into making boosts about assisting ISIS and/or traveling to join ISIS.

I won’t even both doing the math. Anyone who is looking for facts will know that five (5) arrests in a city of 12 million people, doesn’t qualify as “numerous.” Especially when those arrests were for thought crimes and amateurish boosting more than any attempt at an actual crime.

Support more cops to hunt Unicorns. Unicorn hunting doesn’t require military tactics or weapons, thereby making the civilian population safer.

Fast parallel computing with Intel Phi coprocessors

Filed under: Parallel Programming,R — Patrick Durusau @ 1:41 pm

Fast parallel computing with Intel Phi coprocessors by Andrew Ekstrom.

Andrew tells a tale of going from more than a week processing a 10,000×10,000 matrix raised to 10^17 to 6-8 hours and then substantially shorter times. Sigh, using Windows but still an impressive feat! As you might expect, using Revolution Analytics RRO, Intel’s Math Kernel Library (MKL), Intel Phi coprocessors, etc.

There’s enough detail (I suspect) for you to duplicate this feat on your own Windows box, or perhaps more easily on Linux.

Enjoy!

I first saw this in a tweet by David Smith.

The Applications of Probability to Cryptography

Filed under: Cryptography,Mathematics — Patrick Durusau @ 1:25 pm

The Applications of Probability to Cryptography by Alan M. Turing.

From the copyright page:

The underlying manuscript is held by the National Archives in the UK and can be accessed at www.nationalarchives.gov.uk using reference number HW 25/37. Readers are encouraged to obtain a copy.

The original work was under Crown copyright, which has now expired, and the work is now in the public domain.

You can go directly to the record page: http://discovery.nationalarchives.gov.uk/details/r/C11510465.

To get a useful image, you need to add the item to your basket for £3.30.

The manuscript is a mixture of typed text with inserted mathematical expressions added by hand (along with other notes and corrections). This is a typeset version that attempts to capture the original manuscript.

Another recently declassified Turning paper (typeset): The Statistics of Repetition.

Important reads. Turing would appreciate the need to exclude government from our day to day lives.

Apache Cassandra 2.2.0-beta1 released

Filed under: Cassandra — Patrick Durusau @ 12:42 pm

Apache Cassandra 2.2.0-beta1 released

From the post:

The Cassandra team is pleased to announce the release of Apache Cassandra version 2.2.0-beta1.

This release is *not* production ready. We are looking for testing of existing and new features. If you encounter any problem please let us know [1].

Cassandra 2.2 features major enhancements such as:

* Resume-able Bootstrapping
* JSON Support [4]
* User Defined Functions [5]
* Server-side Aggregation [6]
* Role based access control

Read [2] and [3] to learn about all the new features.

Downloads of source and binary distributions are listed in our download section:

http://cassandra.apache.org/download/

Enjoy!

-The Cassandra Team

[1]: https://issues.apache.org/jira/browse/CASSANDRA
[2]: http://goo.gl/MyOEib (NEWS.txt)
[3]: http://goo.gl/MBJd1S (CHANGES.txt)
[4]: http://cassandra.apache.org/doc/cql3/CQL-2.2.html#json
[5]: http://cassandra.apache.org/doc/cql3/CQL-2.2.html#udfs
[6]: http://cassandra.apache.org/doc/cql3/CQL-2.2.html#udas

I was wondering what I would be reading this week! 😉

Enjoy!

You Are Safer Than You Think

Filed under: Government,Security — Patrick Durusau @ 12:25 pm

Despite the budget sustaining efforts of former associate director of the FBI, Thomas Fuentes, Why Does the FBI Have To Manufacture Its Own Plots If Terrorism And ISIS Are Such Grave Threats? by Glenn Greenwald, you are really safer than you think.

Not because security systems are that great, but from a lack of people who want to do you harm.

Think about airport security for a minute. There are the child fondling TSA agents and the ones who like to paw through used underwear but how serious do you think the protection is by those methods? What about all those highly trained and well-paid baggage handlers and other folks at the airport?

The answer to that question can be found in: Airport Baggage Handlers Charged in Wide-Ranging Conspiracy to Transport Drugs Across the Country. Certainly would not want airport security to interfere with the flow of illegal drugs across the country.

From the post:

Fourteen persons have been charged in connection with an alleged wide-ranging criminal conspiracy to violate airport security requirements and transport drugs throughout the country announced U.S. Attorney Melinda Haag of the Northern District of California, Special Agent in Charge José M. Martinez of the Internal Revenue Service-Criminal Investigation’s (IRS-CI) for the Northern District of California and Special Agent in Charge David J. Johnson Federal Bureau of Investigation (FBI). The case highlights the government’s determination to address security concerns in and around the nation’s airports.

In a criminal complaint partially unsealed today, the co-conspirators were described as a drug trafficking organization determined to use the special access some of them had been granted as baggage handlers at the Oakland International Airport to circumvent the security measures in place at the airport. As alleged in the complaint, the baggage handlers entered the Air Operations Area (AOA) of the Oakland Airport while in possession of baggage containing marijuana. The AOA is an area of the airport that is accessible to employees but not to passengers who have completed security screening through a Transportation Security Administration (TSA) checkpoint. The baggage handlers were not required to pass through a TSA security screening checkpoint to enter the AOA. The baggage handlers then used their security badges to open a secure door that separates the AOA from the sterile passenger terminal where outbound passengers, who have already passed through the TSA security and screening checkpoint, wait to board their flights. The baggage handlers then gave the baggage containing drugs to passengers who then transported the drugs in carry-on luggage on their outbound flights. After arriving in a destination city, the drugs were distributed and sold.

I take the line:

The case highlights the government’s determination to address security concerns in and around the nation’s airports.

to mean the government admits that drugs, weapons or bombs are easy to get on board aircraft in the United States.

That should not worry you because since 9/11, despite airline security being as solid as a bucket without a bottom, no one has blown a plane out of the sky, no one has taken over an airplane with a weapon.

That’s 14 years and more than 10.3 billion passengers later, not one bomb, not one take over by weapon, despite laughable airport “security.”

The only conclusion I can draw from the data, is the near total lack of anyone in the United States who wishes to bomb an airplane or take it over with weapons.

I say “near total lack” because there could be someone, but is is less than one person per 10.3 billion. Those are fairly good odds.

May 18, 2015

FreeSearch

Filed under: Computer Science,Search Behavior,Search Engines,Searching — Patrick Durusau @ 5:52 pm

FreeSearch

From the “about” page:

The FreeSearch project is a search system on top of DBLP data provided by Michael Ley. FreeSearch is a joint project of the L3S Research Center and iSearch IT Solutions GmbH.

In this project we develop new methods for simple literature search that works on any catalogs, without requiring in-depth knowledge of the metadata schema. The system helps users proactively and unobtrusively by guessing at each step what the user’s real information need is and providing precise suggestions.

A more detailed description of the system can be found in this publication: FreeSearch – Literature Search in a Natural Way.

You can choose to search across:

DBLP (4,552,889 documents)

TIBKat (2,079,012 documents)

CiteSeer (1,910,493 documents)

BibSonomy (448,166 documents)

Enjoy!

Cell Stores

Filed under: Cell Stores,NoSQL — Patrick Durusau @ 5:31 pm

Cell Stores by Ghislain Fourny.

Abstract:

Cell stores provide a relational-like, tabular level of abstraction to business users while leveraging recent database technologies, such as key-value stores and document stores. This allows to scale up and out the efficient storage and retrieval of highly dimensional data. Cells are the primary citizens and exist in different forms, which can be explained with an analogy to the state of matter: as a gas for efficient storage, as a solid for efficient retrieval, and as a liquid for efficient interaction with the business users. Cell stores were abstracted from, and are compatible with the XBRL standard for importing and exporting data. The first cell store repository contains roughly 200GB of SEC filings data, and proves that retrieving data cubes can be performed in real time (the threshold acceptable by a human user being at most a few seconds).

Github: http://github.com/28msec/cellstore

Demonstration with 200 GB of SEC data.

Tutorial: An Introduction To The Cell Store REST API.

From the tutorial:

Cell stores are a new paradigm of databases. It is decoupled from XBRL and has a data model of its own, yet it natively support XBRL as a file format to exchange data between cell stores.

Traditional relational databases are focused on tables. Document stores are focused on trees. Triple stores are focused on graphs. Well, cell stores are focused on cells. Cells are units of data and also called facts, measures, etc. Think of taking an Excel spreadsheet and a pair of scissors, and of splitting the sheet into its cells. Put these cells in a bag. Pour some more cells that come from other spreadsheets. Many. Millions of cells. Billions of cells. Trillions of cells. You have a cell store.

Why is it so important to store all these cell in a single, big bag? That’s because the main use case for cell stores is the ability to query data across filings. Cell stores are very good at this. They were designed from day one to do this.

Cell stores are very good at reconstructing tables in the presence of highly dimensional data. The idea behind this is based on hypercubes and is called NoLAP (NoSQL Online Analytical Processing). NoLAP extends the OLAP paradigm by removing hypercube rigidity and letting users generate their own hypercubes on the fly on the same pool of cells.

For business users, all of this is completely transparent and hidden. The look and feel of a cell store, in the end, is that of a spreadsheet like Excel. If you are familiar with the pivot table functionality of Excel, cell stores will be straightforward to understand. Also the underlying XBRL is hidden.

XBRL is to cell store what the inside format of .xsls files are to Excel. How many of us have tried to unzip and open an Excel file with a text editor for any other reason than mere curiosity? The same goes for cell stores.

Forget about the complexity of XBRL. Get things done with your data.

The promise of a better user interface alone should be enough to attract attention to this proposal. Yet, so far as I can find, there hasn’t been a lot of use/discussion of it.

I do wonder about this statement in the paper:

When many people define their own taxonomy, this often ends up in redundant terminology. For example, someone might use the term Equity and somebody else Capital. When either querying cells with a hypercube, or loading cells into a spreadsheet, a mapping can be applied so that this redundant terminology is transparent. This way, when a user asks for Equity, (i) she will also get the cells having the concept Capital, (ii) and it will be transparent to her because the Capital concept is overridden with the expected value Equity.

In part because it omits the obvious case of conflicting terminology, that is we both want to use “German” as a term and I mean the language and you mean nationality. In one well known graph database the answer depends on which one of us gets there first. Poor form in my opinion.

Mapping can handle different terms for the same subject but how do we maintain that? Where do I look to discover the reason(s) underlying the mapping? Moreover, in the conflicting case, how do I distinguish otherwise opaque terms that are letter for letter identical?

There may be answers as I delve deeper into the documentation but those are some topic map issues that stood out for me on a first read.

Comments?

A Virtual Database between MongoDB, ElasticSearch, and MarkLogic

Filed under: ElasticSearch,MarkLogic,MongoDB — Patrick Durusau @ 2:21 pm

A Virtual Database between MongoDB, ElasticSearch, and MarkLogic by William Candillon.

From the post:

Virtual Databases enable developers to write applications regardless of the underlying database technologies. We recently updated a database infrastructure from MongoDB and ElasticSearch to MarkLogic without touching the codebase.

We just flipped a switch. We updated the database infrastructure of an application (20k LOC) from MongoDB and Elasticsearch to MarkLogic without changing a single line of code.

Earlier this year, we published a tutorial that shows how the 28msec query technology can enable developers to write applications regardless of the underlying database technology. Recently, we had the opportunity to put it to the test on both a real world use case and a substantial codebase.

At 28msec, we have designed1 and implemented2 an open source modern data warehouse called CellStore. Whereas traditional data warehousing solutions can only support hundreds of fixed dimensions and thus need to ETL the data to analyze, cell stores support an unbounded number of dimensions. Our implementation of the cell store paradigm is around 20k lines of JSONiq queries. Originally the implementation was running on top of MongoDB and Elasticsearch.
….

1. http://arxiv.org/pdf/1410.0600.pdf
2. http://github.com/28msec/cellstore

Impressive work and it merits a separate post on the underlying technology, CellStore.

« Newer PostsOlder Posts »

Powered by WordPress