‘Learning to Rank’ (No Unique Feature Name Fail – Update)

January 24th, 2018

Elasticsearch ‘Learning to Rank’ Released, Bringing Open Source AI to Search Teams

From the post:

Search experts at OpenSource Connections, the Wikimedia Foundation, and Snagajob, deliver open source cognitive search capabilities to the Elasticsearch community. The open source Learning to Rank plugin allows organizations to control search relevance ranking with machine learning. The plugin is currently delivering search results at Wikipedia and Snagajob, providing significant search quality improvements over legacy solutions.

Learning to Rank lets organizations:

  • Directly optimize sales, conversions and user satisfaction in search
  • Personalize search for users
  • Drive deeper insights from a knowledge base
  • Customize ranking down for complex nuance
  • Avoid the sticker shock & lock-in of a proprietary "cognitive search" product

“Our mission is to empower search teams. This plugin gives teams deep control of ranking, allowing machine learning models to be directly deployed to the search engine for relevance ranking” said Doug Turnbull, author of Relevant Search and CTO, OpenSource Connections.

I need to work through all the documentation and examples but:

Feature Names are Unique

Because some model training libraries refer to features by name, Elasticsearch LTR enforces unique names for each features. In the example above, we could not add a new user_rating feature without creating an error.

is a warning of what you (and I) are likely to find.

Really? Someone involved in the design thought globally unique feature names was a good idea? Or at a minimum didn’t realize it is a very bad idea?

Scope anyone? Either in the programming or topic map sense?

Despite the unique feature name fail, I’m sure ‘Learning to Rank’ will be useful. But not as useful as it could have been.

Doug Turnbull (https://twitter.com/softwaredoug) advises that features are scoped by feature stores, so the correct prose would read: “…LTR enforces unique names for each feature within a feature store.”

No fail, just bad writing.

Eset’s Guide to DeObfuscating and DeVirtualizing FinFisher

January 24th, 2018

Eset’s Guide to DeObfuscating and DeVirtualizing FinFisher

From the introduction:

Thanks to its strong anti-analysis measures, the FinFisher spyware has gone largely unexplored. Despite being a prominent surveillance tool, only partial analyses have been published on its more recent samples.

Things were put in motion in the summer of 2017 with ESET’s analysis of FinFisher surveillance campaigns that ESET had discovered in several countries. In the course of our research, we have identified campaigns where internet service providers most probably played the key role in compromising the victims with FinFisher.

When we started thoroughly analyzing this malware, the main part of our effort was overcoming FinFisher’s anti-analysis measures in its Windows versions. The combination of advanced obfuscation techniques and proprietary virtualization makes FinFisher very hard to de-cloak.

To share what we learnt in de-cloaking this malware, we have created this guide to help others take a peek inside FinFisher and analyze it. Apart from offering practical insight into analyzing FinFisher’s virtual machine, the guide can also help readers to understand virtual machine protection in general – that is, proprietary virtual machines found inside a binary and used for software protection. We will not be discussing virtual machines used in interpreted programming languages to provide compatibility across various platforms, such as the Java VM.

We have also analyzed Android versions of FinFisher, whose protection mechanism is based on an open source LLVM obfuscator. It is not as sophisticated or interesting as the protection mechanism used in the Windows versions, thus we will not be discussing it in this guide.

Hopefully, experts from security researchers to malware analysts will make use of this guide to better understand FinFisher’s tools and tactics, and to protect their customers against this omnipotent security and privacy threat.

Beyond me at the moment but one should always try to learn from the very best. Making note of what can’t be understood/used today in hopes of revisiting it in the future.

Numerous reports describe FinFisher as spyware sold exclusively to governments and their agencies. Perhaps less “exclusively” than previously thought.

In any event, FinFisher is reported to be in the wild so perhaps governments that bought Finfisher will be uncovered by FinFisher.

A more deserving group of people is hard to imagine.

Audio Adversarial Examples: Targeted Attacks on Speech-to-Text

January 24th, 2018

Audio Adversarial Examples: Targeted Attacks on Speech-to-Text by Nicholas Carlini and David Wagner.

Abstract:

We construct targeted audio adversarial examples on automatic speech recognition. Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (at a rate of up to 50 characters per second). We apply our iterative optimization-based attack to Mozilla’s implementation DeepSpeech end-to-end, and show it has a 100% success rate. The feasibility of this attack introduce a new domain to study adversarial examples.

You can consult the data used and code at: http://nicholas.carlini.com/code/audio_adversarial_examples.

Important not only for defeating automatic speech recognition but also for establishing properties of audio recognition differ from visual recognition.

A hint that automatic recognition properties cannot be assumed for unexplored domains.

Visualizing trigrams with the Tidyverse (Who Reads Jane Austen?)

January 24th, 2018

Visualizing trigrams with the Tidyverse by Emil Hvitfeldt.

From the post:

In this post I’ll go though how I created the data visualization I posted yesterday on twitter:

Great post and R code, but who reads Jane Austen? 😉

I have a serious weakness for academic and ancient texts so the Jane Austen question is meant in jest.

The more direct question is to what other texts would you apply this trigram/visualization technique?

Suggestions?

I have some texts in mind but defer mentioning them while I prepare a demonstration of Hvitfeldt’s technique to them.

PS: I ran across an odd comment in the janeaustenr package:

Each text is in a character vector with elements of about 70 characters.

You have to hunt for a bit but 70 characters is the default plain text line length at Gutenberg. Some poor decisions are going to be with us for a very long time.

Data Science at the Command Line (update, now online for free)

January 24th, 2018

Data Science at the Command Line by Jeroen Janssens.

From the webpage:

This is the website for Data Science at the Command Line, published by O’Reilly October 2014 First Edition. This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.

To get you started—whether you’re on Windows, macOS, or Linux—author Jeroen Janssens has developed a Docker image packed with over 80 command-line tools.

Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.

I posted about Data Science at the Command Line in August of 2014 and it remains as relevant today as when originally published.

Impress your friends, perhaps your manager, but most importantly, yourself.

Enjoy!

Games = Geeks, Geeks = People with Access (New Paths To Transparency)

January 24th, 2018

Critical Flaw in All Blizzard Games Could Let Hackers Hijack Millions of PCs by Mohit Kumar.

From the post:

A Google security researcher has discovered a severe vulnerability in Blizzard games that could allow remote attackers to run malicious code on gamers’ computers.

Played every month by half a billion users—World of Warcraft, Overwatch, Diablo III, Hearthstone and Starcraft II are popular online games created by Blizzard Entertainment.

To play Blizzard games online using web browsers, users need to install a game client application, called ‘Blizzard Update Agent,’ onto their systems that run JSON-RPC server over HTTP protocol on port 1120, and “accepts commands to install, uninstall, change settings, update and other maintenance related options.”
… (emphasis in original)

See Kumar’s post for the details on “DNS Rebinding.”

Unless you are running a bot net, why would anyone want to hijack millions of PCs?

If you wanted to rob for cash, would you rob people buying subway tokens or would you rob a bank? (That’s not a trick question. Bank is the correct answer.)

The same is true with creating government or corporate transparency. You could subvert every computer at a location but the smart money says to breach the server and collect all the documents from that central location.

How to breach servers? Target sysadmins, i.e., the people who play computer games.

PS: I would not be overly concerned with Blizzard’s reported development of patches. No doubt other holes exist or will be created by their patches.

The vector algebra war: a historical perspective [Semantic Confusion in Engineering and Physics]

January 23rd, 2018

The vector algebra war: a historical perspective by James M. Chappell, Azhar Iqbal, John G. Hartnett, Derek Abbott.

Abstract:

There are a wide variety of different vector formalisms currently utilized in engineering and physics. For example, Gibbs’ three-vectors, Minkowski four-vectors, complex spinors in quantum mechanics, quaternions used to describe rigid body rotations and vectors defined in Clifford geometric algebra. With such a range of vector formalisms in use, it thus appears that there is as yet no general agreement on a vector formalism suitable for science as a whole. This is surprising, in that, one of the primary goals of nineteenth century science was to suitably describe vectors in three-dimensional space. This situation has also had the unfortunate consequence of fragmenting knowledge across many disciplines, and requiring a significant amount of time and effort in learning the various formalisms. We thus historically review the development of our various vector systems and conclude that Clifford’s multivectors best fulfills the goal of describing vectorial quantities in three dimensions and providing a unified vector system for science.

An image from the paper captures the “descent of the various vector systems:”

The authors contend for use of Clifford’s multivectors over the other vector formalisms described.

Assuming Clifford’s multivectors displace all other systems in use, the authors fail to answer how readers will access the present and past legacy of materials in other formalisms?

If the goal is to eliminate “fragmenting knowledge across many disciplines, and requiring a significant amount of time and effort in learning the various formalisms,” that fails in the absence of a mechanism to access existing materials using the Clifford’s multivector formalism.

Topic maps anyone?

Stop, Stop, Stop All the Patching, Give Intel Time to Breath

January 23rd, 2018

Root Cause of Reboot Issue Identified; Updated Guidance for Customers and Partners by Navin Shenoy.

From the post:

As we start the week, I want to provide an update on the reboot issues we reported Jan. 11. We have now identified the root cause for Broadwell and Haswell platforms, and made good progress in developing a solution to address it. Over the weekend, we began rolling out an early version of the updated solution to industry partners for testing, and we will make a final release available once that testing has been completed.

Based on this, we are updating our guidance for customers and partners:

  • We recommend that OEMs, cloud service providers, system manufacturers, software vendors and end users stop deployment of current versions, as they may introduce higher than expected reboots and other unpredictable system behavior. For the full list of platforms, see the Intel.com Security Center site.
  • We ask that our industry partners focus efforts on testing early versions of the updated solution so we can accelerate its release. We expect to share more details on timing later this week.
  • We continue to urge all customers to vigilantly maintain security best practice and for consumers to keep systems up-to-date.

I apologize for any disruption this change in guidance may cause. The security of our products is critical for Intel, our customers and partners, and for me, personally. I assure you we are working around the clock to ensure we are addressing these issues.

I will keep you updated as we learn more and thank you for your patience.

Essence of Shenoy’s advice:

…OEMs, cloud service providers, system manufacturers, software vendors and end users stop deployment of current versions, as they may introduce higher than expected reboots and other unpredictable system behavior.

Or better:

Patching an Intel machine makes it worse.

That’s hardly news.

Unverifiable firmware/code + unverifiable patch = unverifiable firmware/code + patch. What part of that seems unclear?

WebGoat (Advantage over OPM)

January 22nd, 2018

Deliberately Insecure Web Application: OWASP WebGoat

From the webpage:

WebGoat is a deliberately insecure web application maintained by OWASP designed to teach web application security lessons. You can install and practice with WebGoat in either J2EE or WebGoat for .Net in ASP.NET. In each lesson, users must demonstrate their understanding of a security issue by exploiting a real vulnerability in the WebGoat applications.

WebGoat for J2EE is written in Java and therefore installs on any platform with a Java virtual machine. Once deployed, the user can go through the lessons and track their progress with the scorecard.

WebGoat’s scorecards are a feature not found when hacking Office of Personnel Management (OPM). Hacks of the OPM are reported by its inspector general and more generally in the computer security press.

EFF Investigates Dark Caracal (But Why?)

January 22nd, 2018

Someone is touting a mobile, PC spyware platform called Dark Caracal to governments by Iain Thomson.

From the post:

An investigation by the Electronic Frontier Foundation and security biz Lookout has uncovered Dark Caracal, a surveillance-toolkit-for-hire that has been used to suck huge amounts of data from Android mobiles and Windows desktop PCs around the world.

Dark Caracal [PDF] appears to be controlled from the Lebanon General Directorate of General Security in Beirut – an intelligence agency – and has slurped hundreds of gigabytes of information from devices. It shares its backend infrastructure with another state-sponsored surveillance campaign, Operation Manul, which the EFF claims was operated by the Kazakhstan government last year.

Crucially, it appears someone is renting out the Dark Caracal spyware platform to nation-state snoops.

The EFF could be spending its time and resources duplicating Dark Caracal for the average citizen.

Instead the EFF continues its quixotic pursuit of governmental wrong-doers. I say “quixotic” because those pilloried by the EFF, such as the NSA, never change their behavior. Unlawful conduct, including surveillance continues.

But don’t take my word for it, the NSA admits that it deletes data it promised under court order to preserve: NSA deleted surveillance data it pledged to preserve. No consequences. Just like there were no consequences when Snowden revealed widespread and illegal surveillance by the NSA.

So you have to wonder, if investigating and suing governmental intelligence organizations produces no tangible results, why is the EFF pursuing them?

If the average citizen had the equivalent of Dark Caracal at their disposal, say as desktop software, the ability of governments like Lebanon, Kazakhstan, and others, to hide their crimes, would be greatly reduced.

Exposure is no guarantee of accountability and/or punishment, but the wack-a-mole strategy of the EFF hasn’t produced transparency or consequences.

Don Knuth Needs Your Help

January 22nd, 2018

Donald Knuth Turns 80, Seeks Problem-Solvers For TAOCP

From the post:

An anonymous reader writes:

When 24-year-old Donald Knuth began writing The Art of Computer Programming, he had no idea that he’d still be working on it 56 years later. This month he also celebrated his 80th birthday in Sweden with the world premier of Knuth’s Fantasia Apocalyptica, a multimedia work for pipe organ and video based on the bible’s Book of Revelations, which Knuth describes as “50 years in the making.”

But Knuth also points to the recent publication of “one of the most important sections of The Art of Computer Programming” in preliminary paperback form: Volume 4, Fascicle 6: Satisfiability. (“Given a Boolean function, can its variables be set to at least one pattern of 0s and 1 that will make the function true?”)

Here’s an excerpt from its back cover:

Revolutionary methods for solving such problems emerged at the beginning of the twenty-first century, and they’ve led to game-changing applications in industry. These so-called “SAT solvers” can now routinely find solutions to practical problems that involve millions of variables and were thought until very recently to be hopelessly difficult.

“in several noteworthy cases, nobody has yet pointed out any errors…” Knuth writes on his site, adding “I fear that the most probable hypothesis is that nobody has been sufficiently motivated to check these things out carefully as yet.” He’s uncomfortable printing a hardcover edition that hasn’t been fully vetted, and “I would like to enter here a plea for some readers to tell me explicitly, ‘Dear Don, I have read exercise N and its answer very carefully, and I believe that it is 100% correct,'” where N is one of the exercises listed on his web site.

Elsewhere he writes that two “pre-fascicles” — 5a and 5B — are also available for alpha-testing. “I’ve put them online primarily so that experts in the field can check the contents before I inflict them on a wider audience. But if you want to help debug them, please go right ahead.”

Do you have some other leisure project for 2018 that is more important?

😉

A “no one saw” It Coming Memory Hack (Schneider Electric)

January 21st, 2018

Schneider Electric: TRITON/TRISIS Attack Used 0-Day Flaw in its Safety Controller System, and a RAT by Kelly Jackson Higgins.

Industrial control systems giant Schneider Electric discovered a zero-day privilege-escalation vulnerability in its Triconex Tricon safety-controller firmware which helped allow sophisticated hackers to wrest control of the emergency shutdown system in a targeted attack on one of its customers.

Researchers at Schneider also found a remote access Trojan (RAT) in the so-called TRITON/TRISIS malware that they say represents the first-ever RAT to infect safety-instrumented systems (SIS) equipment. Industrial sites such as oil and gas and water utilities typically run multiple SISes to independently monitor critical systems to ensure they are operating within acceptable safety thresholds, and when they are not, the SIS automatically shuts them down.

Schneider here today provided the first details of its investigation of the recently revealed TRITON/TRISIS attack that targeted a specific SIS used by one of its industrial customers. Two of the customer’s SIS controllers entered a failed safe mode that shut down the industrial process and ultimately led to the discovery of the malware.

Teams of researchers from Dragos and FireEye’s Mandiant last month each published their own analysis of the malware used in the attack, noting that the smoking gun – a payload that would execute a cyber-physical attack – had not been found.

Perhaps the most amusing part of the post is Schneider’s attribution of near super-human capabilities to the hackers:


Schneider’s controller is based on proprietary hardware that runs on a PowerPC processor. “We run our own proprietary operating system on top of that, and that OS is not known to the public. So the research required to pull this [attack] off was substantial,” including reverse-engineering it, Forney says. “This bears resemblance to a nation-state, someone who was highly financed.”

The attackers also had knowledge of Schneider’s proprietary protocol for Tricon, which also is undocumented publicly, and used it to create their own library for sending commands to interact with Tricon, he says.

Alternatives to a nation-state:

  • 15 year old working with junked Schneider hardware and the Schneider help desk
  • Disgruntled Schneider Electric employee or their children
  • Malware planted to force a quick and insecure patch being pushed out

I discount all the security chest beating by vendors. Their goal: continued use of their products.

Are your Schneider controllers are air-gapped and audited?

Bludgeoning Bootloader Bugs:… (Rebecca “.bx” Shapiro – job hunting)

January 21st, 2018

Bludgeoning Bootloader Bugs: No write left behind by Rebecca “.bx” Shapiro.

Slides from ShmooCon 2018.

If you are new to bootloading, consider Shapiro’s two blog post on the topic:

A History of Linux Kernel Module Signing

A Toure of Bootloading

both from 2015, and her resources page.

Aside from the slides, her most current work is found at: https://github.com/bx/bootloader_instrumentation_suite.

ShmooCon 2018 just finished earlier today but check for the ShmooCon archives to see a video of Sharpio’s presentation.

I don’t normally post shout-outs for people seeking employment but Shario does impressive work and she is sharing it with the broader community. Unlike some governments and corporations we could all name. Pass her name and details along.

Are You Smarter Than A 15 Year Old?

January 21st, 2018

15-Year-Old Schoolboy Posed as CIA Chief to Hack Highly Sensitive Information by Mohit Kumar.

From the post:

A notorious pro-Palestinian hacking group behind a series of embarrassing hacks against United States intelligence officials and leaked the personal details of 20,000 FBI agents, 9,000 Department of Homeland Security officers, and some number of DoJ staffers in 2015.

Believe or not, the leader of this hacking group was just 15-years-old when he used “social engineering” to impersonate CIA director and unauthorisedly access highly sensitive information from his Leicestershire home, revealed during a court hearing on Tuesday.

Kane Gamble, now 18-year-old, the British teenager hacker targeted then CIA director John Brennan, Director of National Intelligence James Clapper, Secretary of Homeland Security Jeh Johnson, FBI deputy director Mark Giuliano, as well as other senior FBI figures.

Between June 2015 and February 2016, Gamble posed as Brennan and tricked call centre and helpline staff into giving away broadband and cable passwords, using which the team also gained access to plans for intelligence operations in Afghanistan and Iran.

Gamble said he targeted the US government because he was “getting more and more annoyed about how corrupt and cold-blooded the US Government” was and “decided to do something about it.”

Your questions:

1. Are You Smarter Than A 15 Year Old?

2. Are You Annoyed by a Corrupt and Cold-blooded Government?

3. Have You Decided to do Something about It?

Yeses for #1 and #2 number in the hundreds of millions.

The lack of governments hemorrhaging data worldwide is silent proof that #3 is a very small number.

What’s your answer to #3? (Don’t post it in the comments.)

Collaborative Journalism Projects (Collaboration Opportunities for the Public?)

January 21st, 2018

Database: Search, sort and learn about collaborative journalism projects from around the world

From the post:

Over the past several months, the Center for Cooperative Media has been collecting, organizing and standardizing information about dozens and dozens of collaborative journalism projects around the world. Our goal was to build a database that could serve as a hub of information about collaborative journalism, something that would be useful to journalists, scholars, media executives, funders and others seeking information on the how such projects work, who’s doing them and what they’re covering.

We worked with Melody Kramer to build the first iteration of the database, which you can find below. It is a work in progress, and you’ll see that it’s still incomplete as we continue to add to it. So far for this soft launch, we’ve input information on 94 news collaborations between more than 800 organizations and 151 people.

But this is just the beginning. We need your help.

Is your project listed? If not, tell us about it. Is the information about your project incorrect? Let us know; email Melody at melodykramer@gmail.com. Are there fields missing you’d like to see us add, or other ways to sort that you think would be useful? Email the Center at info@centerforcooperativemedia.org. We’re using Airtable right now, but are still considering what the best way will be to display the treasure trove of data we’re collecting.

Some notes on navigating the database: First, it’s easier to see the whole picture on desktop than on mobile, although both work well. To see the full record for any particular project, click on the little blue arrow that appears to the left of the project name when you hover over it. You can sort by column as well.

Collaborative journalism is a great way to avoid duplication of effort and to find strength in numbers. This resource is a big step towards encouraging journalist to journalist collaboration.

Opportunities for members of the public to collaborate with journalists?

Suggestions?

What Can Reverse Engineering Do For You?

January 18th, 2018

From the description:

Reverse engineering is a core skill in the information security space, but it doesn’t necessarily get the wide spread exposure that other skills do even though it can help you with your security challenges. We will talk about getting you quickly up and running with a reverse engineering starter pack and explore some interesting x86 assembly code patterns you may encounter in the wild. These patterns are essentially common malware evasion techniques that include packing, analysis evasion, shellcode execution, and crypto usages. It is not always easy recognizing when a technique is used. This talk will begin by defining the each technique as a pattern and then the approaches for reading or bypassing the evasion.

Technical keynote at Shellcon 2017 by Amanda Rousseau (@malwareunicorn).

Even if you’re not interested in reverse engineering, watch the video to see a true master describing their craft.

The “patterns” she speaks of are what I would call “subject identity” in a topic maps context.

TLDR pages (man pages by example)

January 18th, 2018

TLDR pages

From the webpage:

The TLDR pages are a community effort to simplify the beloved man pages with practical examples.

The TLDR Pages Book (pdf), has 274 pages!

If you have ever hunted through a man page for an example, you will appreciate TLDR pages!

I first saw this in a tweet by Christophe Lalanne.

Launch of DECLASSIFIED

January 18th, 2018

Launch of DECLASSIFIED by Mark Curtis.

From the post:

I am about to publish on this site hundreds of UK declassified documents and articles on British foreign policy towards various countries. This will be the first time such a collection has been brought together online.

The declassified documents, mainly from the UK’s National Archives, reveal British policy-makers actual concerns and priorities from the 1940s until the present day, from the ‘horse’s mouth’, as it were: these files are often revelatory and provide an antidote to the often misleading and false mainstream media (and academic) coverage of Britain’s past and present foreign policies.

The documents include my collections of files, accumulated over many years and used as a basis for several books, on episodes such as the UK’s covert war in Yemen in the 1960s, the UK’s support for the Pinochet coup in Chile, the UK’s ‘constitutional coup’ in Guyana, the covert wars in Indonesia in the 1950s, the UK’s backing for wars against the Iraqi Kurds in the 1960s, the coup in Oman in 1970, support for the Idi Amin takeover in Uganda and many others policies since 1945.

But the collection also brings together many other declassified documents by listing dozens of media articles that have been written on the release of declassified files over the years. It also points to some US document releases from the US National Security Archive.

A new resource for those of you tracking the antics of the small and the silly through the 20th and into the 21st century.

I say the “small and the silly” because there’s no doubt that similar machinations have been part and parcel of government toady lives so long as there have been governments. Despite the exaggerated sense of their own importance and the history making importance of their efforts, almost none of their names survive in the ancient historical record.

With the progress of time, the same fate awaits the most recent and current crop of government familiars. While we wait for them to pass into obscurity, you can amuse yourself by outing them and tracking their activities.

This new archive may assist you in your efforts.

Be sure to keep topic maps in mind for mapping between disjoint vocabularies and collections of documents as well as accounts of events.

For Some Definition of “Read” and “Answer” – MS Clickbait

January 18th, 2018

Microsoft creates AI that can read a document and answer questions about it as well as a person by Allison Linn.

From the post:

It’s a major milestone in the push to have search engines such as Bing and intelligent assistants such as Cortana interact with people and provide information in more natural ways, much like people communicate with each other.

A team at Microsoft Research Asia reached the human parity milestone using the Stanford Question Answering Dataset, known among researchers as SQuAD. It’s a machine reading comprehension dataset that is made up of questions about a set of Wikipedia articles.

According to the SQuAD leaderboard, on Jan. 3, Microsoft submitted a model that reached the score of 82.650 on the exact match portion. The human performance on the same set of questions and answers is 82.304. On Jan. 5, researchers with the Chinese e-commerce company Alibaba submitted a score of 82.440, also about the same as a human.

With machine reading comprehension, researchers say computers also would be able to quickly parse through information found in books and documents and provide people with the information they need most in an easily understandable way.

That would let drivers more easily find the answer they need in a dense car manual, saving time and effort in tense or difficult situations.

These tools also could let doctors, lawyers and other experts more quickly get through the drudgery of things like reading through large documents for specific medical findings or rarified legal precedent. The technology would augment their work and leave them with more time to apply the knowledge to focus on treating patients or formulating legal opinions.

Wait, wait! If you read the details about SQuAD, you realize how far Microsoft (or anyone else) is from “…reading through large documents for specific medical findings or rarified legal precedent….”

What is the SQuAD test?

Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.

Not to take anything away from Microsoft Research Asia or the creators of SQuAD, but “…the answer to every question is a segment of text, or span, from the corresponding reading passage.” is a long way from synthesizing an answer from a long legal document.

The first hurdle is asking a question that can be scored against every “…segment of text, or span…” such that a relevant snippet of text can be found.

The second hurdle is the process of scoring snippets of text in order to retrieve the most useful one. That’s a mechanical process, not one that depends on the semantics of the underlying question or text.

There are other hurdles but those two suffice to show there is no “reading and answering questions” in the same sense we would apply to any human reader.

Click-bait headlines don’t serve the cause of advocating more AI research. On the contrary, a close reading of alleged progress leads to disappointment.

Tips for Entering the Penetration Testing Field

January 16th, 2018

Tips for Entering the Penetration Testing Field by Ed Skoudis.

From the post:

It’s an exciting time to be a professional penetration tester. As malicious computer attackers amp up the number and magnitude of their breaches, the information security industry needs an enormous amount of help in proactively finding and resolving vulnerabilities. Penetration testers who are able to identify flaws, understand them, and demonstrate their business impact through careful exploitation are an important piece of the defensive puzzle.

In the courses I teach on penetration testing, I’m frequently asked about how someone can land their first job in the field after they’ve acquired the appropriate technical skills and gained a good understanding of methodologies. Also, over the past decade, I’ve counseled a lot of my friends and acquaintances as they’ve moved into various penetration testing jobs. Although there are many different paths to pen test nirvana, let’s zoom into three of the most promising. It’s worth noting that these three paths aren’t mutually exclusive either. I know many people who started on the first path, jumped to the second mid-way, and later found themselves on path #3. Or, you can jumble them up in arbitrary order.

Career advice and a great listing of resources for any aspiring penetration “tester.”

If you do penetration work for a government, you may be a national hero. If you do commercial penetration testing, not a national hero but not on the run either. If you do non-sanctioned penetration work, life is uncertain. Same skill, same activity. Go figure.

Updated Hacking Challenge Site Links (Signatures as Subject Identifiers)

January 16th, 2018

Updated Hacking Challenge Site Links

From the post:

These are 70+ sites which offer free challenges for hackers to practice their skills. Some are web-based challenges, some require VPN access to private labs and some are downloadable ISOs and VMs. I’ve tested the links at the time of this posting and they work.

Most of them are at https://www.wechall.net but if I missed a few they will be there.

WeChall is a portal to hacking challenges where you can link your account to all the sites and get ranked. I’ve been a member since 2/2/14.

Internally to the site they have challenges there as well so make sure you check them out!

To find CTFs go to https://www.ctftime.org

On Twitter in the search field type CTF

Google is also your friend.

I’d rephrase “Google is also your friend.” to “Sometimes Google allows you to find ….”

When visiting hacker or CTF (capture the flag) sites, use the same levels of security as any government or other known hostile site.

What is an exploit or vulnerability signature if not a subject identifier?

Data Science Bowl 2018 – Spot Nuclei. Speed Cures.

January 16th, 2018

Spot Nuclei. Speed Cures.

From the webpage:

The 2018 Data Science Bowl offers our most ambitious mission yet: Create an algorithm to automate nucleus detection and unlock faster cures.

Compete on Kaggle

Three months. $100,000.

Even if you “lose,” think of the experience you will gain. No losers.

Enjoy!

PS: Just thinking outloud but if:


This dataset contains a large number of segmented nuclei images. The images were acquired under a variety of conditions and vary in the cell type, magnification, and imaging modality (brightfield vs. fluorescence). The dataset is designed to challenge an algorithm’s ability to generalize across these variations.

isn’t the ability to generalize, with lower performance a downside?

Why not use the best algorithm for a specified set of data conditions, “merging” that algorithm so to speak, so that scientists always have the best algorithm for their specific data set.

So outside the contest, perhaps recognizing the conditions of the images are the most important subjects and they should be matched to the best conditions for particular algorithms.

Anyone interested in collaborating on a topic map entry?

The Art & Science Factory

January 15th, 2018

The Art & Science Factory

From the about page:


The Art & Science Factory was started in 2008 by Dr. Brian Castellani to organize the various artistic, scientific and educational endeavours he and different collaborators have engaged in to address the growing complexity of global life.

Dr. Castellani is a complexity scientist/artist.

He is internationally recognized for his expertise in complexity science and its history and for his development of the SACS Toolkit, a case-based, mixed-methods, computationally-grounded framework for modeling complex systems. Dr. Castellani’s main area of study is applying complexity science and the SACS Toolkit to various topics in health and healthcare, including community health and medical education.

In terms of visual complexity, Castellani is recognized around the world for his creation of the complexity map, which can be found on Wikipedia and on this website. He is also recognized for his blog on “all things complexity science and art,” the Sociology and Complexity Science Blog.
… (emphasis in original)

Dr. Castellani apparently dislikes searchable text, the about page quote being hand transcribed from an image that is that page.

Unexpectedly, the SACS toolkit, etc. were not hyperlinks so: SACS toolkit, complexity map, and Sociology and Complexity Science Blog, respectively.

2018 Map of the Complexity Sciences

January 15th, 2018

2018 Map of the Complexity Sciences by Brian Castellani.

At full screen this map barely displays on my 22″ monitor so I’m not going to mangle it into something smaller for this post.

The reading instructions read in part:


Also, in order to present some type of organizational structure, the history of the complexity sciences is developed along the field’s five major intellectual traditions: dynamical systems theory (purple), systems science (blue, complex systems theory (yellow, cybernetics (gray) and artificial intelligence (orange. Again, the fit is not exact (and sometimes even somewhat forced); but it is sufficient to help those new to the field gain a sense of its evolving history.

The subject and person nodes are all hyperlinks to additional resources!

Enjoy!

Fun, Frustration, Curiosity, Murderous Rage – mimic

January 15th, 2018

mimic

From the webpage:


There are many more characters in the Unicode character set that look, to some extent or another, like others – homoglyphs. Mimic substitutes common ASCII characters for obscure homoglyphs.

Fun games to play with mimic:

  • Pipe some source code through and see if you can find all of the problems
  • Pipe someone else’s source code through without telling them
  • Be fired, and then killed

I can attest to the murderous rage from experience. There was a browser-based SGML parser that would barf on the presence of an extra whitespace (space I think) in the SGML declaration. One file worked, another with the “same” declaration did not.

Only by printing and comparing the files (this was on Windoze machines) was the errant space discovered.

Enjoy!

Tactical Advantage: I don’t have to know everything, just more than you.

January 12th, 2018

Mapping the Ghostly Traces of Abandoned Railroads – An interactive, crowdsourced atlas plots vanished transit routes by Jessica Leigh Hester.

From the post:

In the 1830s, a rail line linked Elkton, Maryland, with New Castle, Delaware, shortening the time it took to shuttle people and goods between the Delaware River and Chesapeake Bay. Today you’d never know it had been there. A photograph snapped years after the line had been abandoned captures a stone culvert halfway to collapse into the creek it spanned. Another image, captured even later, shows a relict trail that looks more like a footpath than a railroad right-of-way. The compacted dirt seems wide enough to accommodate no more than two pairs of shoes at a time.

The scar of the New Castle and Frenchtown Railroad barely whispers of the railcars that once barreled through. That’s what earned it a place on Andrew Grigg’s map.

For the past two years, Grigg, a transit enthusiast, has been building an interactive atlas of abandoned railroads. Using Google Maps, he lays the ghostly silhouettes of the lines over modern aerial imagery. His recreation of the 16-mile New Castle and Frenchtown Line crosses state lines and modern highways, marches through suburban housing developments, and passes near a cineplex, a Walmart, and a paintball field.
… (emphasis in original)

Great example of a project capturing travel paths that may be omitted from modern maps. Being omitted from a map doesn’t impact the potential use of an abandoned railway as an alternative to other routes.

Be sure to check ahead of time but digital navigation systems may have omitted discontinued railroads.

The same advantage obtains if you know which underpasses flood after a heavy rain, which streets are impassable, when trains are passing over certain crossings, all manner of information that isn’t captured by standard digital navigation systems.

What information can you add to a map that isn’t known to or thought to be important by others?

Computus manuscripts and where to find them

January 12th, 2018

Computus manuscripts and where to find them

An interactive map of computus manuscripts by place of preservation.

A poor screen shot:

From the about page:

Welcome to the bèta version of Computus.lat, an online platform for teaching and research in studies of the medieval science of computus. Computus.lat consists of a catalogue of computistical manuscripts and computistical objects, a bibliography, and a number of resources (such as a Mirador-viewer and data visualizations).

Follow @computuslat on Twitter for updates.

Kind regards,
Thom Snijders

Over 500 manuscripts online!

Oh, Computus:

Computus in its simplest definition is the art of ascertaining time by the course of the sun and the moon. This art could be and was a theoretical science, such as that explored by Johannes of Sacrobosco in his De sphera–a science based on arithmetical calculations and astronomical measurements derived from use of the astrolabe or, increasingly by the end of the 13th century, the solar quadrant. In the context of the present exhibit, however, computus is understood mainly as the practical application of these calculations. To reckon time in the broadest sense and to determine the date of Easter became one and the same effort. And for most people, understanding the problem of correct alignment of solar, lunar, yearly and weekly cycles to arrive at the date of Easter was simply reduced to a question of “when?” rather than “why?”. The result was a profusion of calculation formulae, charts and memory devices.

Accompanying these handy mechanisms for determining the date of Easter were many other bits of calendrical information that faith, prejudice and experience leveled to the same degree of acceptance and necessity: the lucky and the unlucky days for travel or for eating goose; the prognostications of rain or wind; the times for bloodletting; the signs of the zodiac; the phases of the moon; the number of hours of sunshine in a given day; the feasts of the saints; the Sundays in a perpetual calendar.

Take heed of the line: “The result was a profusion of calculation formulae, charts and memory devices.” (emphasis added)

And you think we have trouble with daylight savings time and time zones. 😉

Pass this along to manuscript scholars, liturgy buffs, historians, anyone interested in out diverse religious history.

A [Selective] Field Guide to “Fake News” and other Information Disorders

January 12th, 2018

New guide helps journalists, researchers investigate misinformation, memes and trolling by Liliana Bounegru and Jonathan Gray.

Recent scandals about the role of social media in key political events in the US, UK and other European countries over the past couple of years have underscored the need to understand the interactions between digital platforms, misleading information and propaganda, and their influence on collective life in democracies.

In response to this, the Public Data Lab and First Draft collaborated last year to develop a free, open-access guide to help students, journalists and researchers investigate misleading and viral content, memes and trolling practices online.

Released today, the five chapters of the guide describe a series of research protocols or “recipes” that can be used to trace trolling practices, the ways false viral news and memes circulate online, and the commercial underpinnings of problematic content. Each recipe provides an accessible overview of the key steps, methods, techniques and datasets used.

The guide will be most useful to digitally savvy and social media literate students, journalists and researchers. However, the recipes range from easy formulae that can be executed without much technical knowledge other than a working understanding of tools such as BuzzSumo and the CrowdTangle browser extension, to ones that draw on more advanced computational techniques. Where possible, we try to offer the recipes in both variants.

Download the guide at the Public Data Lab’s website.

The techniques in the guide are fascinating but the underlying definition of “fake news” is problematic:


The guide explores the notion that fake news is not just another type of content that circulates online, but that it is precisely the character of this online circulation and reception that makes something into fake news. In this sense fake news may be considered not just in terms of the form or content of the message, but also in terms of the mediating infrastructures, platforms and participatory cultures which facilitate its circulation. In this sense, the significance of fake news cannot be fully understood apart from its circulation online. It is the register of this circulation that also enables us to trace how material that starts its life as niche satire can be repackaged as hyper-partisan clickbait to generate advertising money and then continue life as an illustration of dangerous political misinformation.

As a consequence this field guide encourages a shift from focusing on the formal content of fabrications in isolation to understanding the contexts in which they circulate online. This shift points to the limits of a “deficit model” approach – which might imply that fabrications thrive only because of a deficit of factual information. In the guide we suggest new ways of mapping and responding to fake news beyond identifying and fact-checking suspect claims – including “thicker” accounts of circulation as a way to develop a richer understanding of how fake news moves and mobilises people, more nuanced accounts of “fakeness” and responses which are better attuned to the phenomenon.
… (page 8)

The means by which information circulates is always relevant to the study of communications. However, notice that the authors’ definition excludes traditional media from its quest to identify “fake news.” Really? Traditional media isn’t responsible for the circulation of any “fake news?”

Examples of traditional media fails are legion but here is a recent and spectacular one: The U.S. Media Suffered Its Most Humiliating Debacle in Ages and Now Refuses All Transparency Over What Happened by Glenn Greenwald.

Friday was one of the most embarrassing days for the U.S. media in quite a long time. The humiliation orgy was kicked off by CNN, with MSNBC and CBS close behind, and countless pundits, commentators, and operatives joining the party throughout the day. By the end of the day, it was clear that several of the nation’s largest and most influential news outlets had spread an explosive but completely false news story to millions of people, while refusing to provide any explanation of how it happened.

The spectacle began Friday morning at 11 a.m. EST, when the Most Trusted Name in News™ spent 12 straight minutes on air flamboyantly hyping an exclusive bombshell report that seemed to prove that WikiLeaks, last September, had secretly offered the Trump campaign, even Donald Trump himself, special access to the Democratic National Committee emails before they were published on the internet. As CNN sees the world, this would prove collusion between the Trump family and WikiLeaks and, more importantly, between Trump and Russia, since the U.S. intelligence community regards WikiLeaks as an “arm of Russian intelligence,” and therefore, so does the U.S. media.

This entire revelation was based on an email that CNN strongly implied it had exclusively obtained and had in its possession. The email was sent by someone named “Michael J. Erickson” — someone nobody had heard of previously and whom CNN could not identify — to Donald Trump Jr., offering a decryption key and access to DNC emails that WikiLeaks had “uploaded.” The email was a smoking gun, in CNN’s extremely excited mind, because it was dated September 4 — 10 days before WikiLeaks began promoting access to those emails online — and thus proved that the Trump family was being offered special, unique access to the DNC archive: likely by WikiLeaks and the Kremlin.

There was just one small problem with this story: It was fundamentally false, in the most embarrassing way possible. Hours after CNN broadcast its story — and then hyped it over and over and over — the Washington Post reported that CNN got the key fact of the story wrong.

This fundamentally false story does not qualify as “fake news” for this guide. Surprised?

The criteria for “fake news” also excludes questioning statements from members of the intelligence community, which includes James Clapper, a self-confessed and known liar, who continues to be the darling of mainstream media outlets.

Cozy relationships between news organizations and their reporters with government and intelligence sources are also not addressed as potential sources of “fake news.”

Limiting the scope of a “fake news” study in order to have a doable project is understandable. However, excluding factually false stories, use of known liars and corrupting relationships, all because they occur in mainstream media, looks like picking a target to tar with the label “fake news.”

The guides and techniques themselves may be quite useful, so long as you remember they were designed to show social media as the spreader of “fake news.”

One last thing, what the authors don’t offer and I haven’t seen reports of, is the effectiveness of the so-called “fake news” with voters. Taking “Pope Francis Endorses Trump,” as a lie, however widely spread that story became, did it have any impact on the 2016 election? Or did every reader do a double-take and move on? It’s possible to answer that type of question but it does require facts.

Getting Started with Python/CLTK for Historical Languages

January 12th, 2018

Getting Started with Python/CLTK for Historical Languages by Patrick J. Burns.

From the post:

This is a ongoing project to collect online resources for anybody looking to get started with working with Python for historical languages, esp. using the Classical Language Toolkit. If you have suggestions for this lists, email me at patrick[at]diyclassics[dot]org.

What classic or historical language resources would you recommend?

Complete Guide to Topic Modeling (Recommender System for Email Dumps?)

January 12th, 2018

Complete Guide to Topic Modeling with scikit-learn and gensim by George-Bogdan Ivanov.

From the post:

Why is Topic Modeling useful?

There are several scenarios when topic modeling can prove useful. Here are some of them:

  • Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature
  • Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.
  • Uncovering Themes in Texts – Useful for detecting trends in online publications for example

Would a recommender system be useful for reading email dumps? 😉

Within or across candidates for Congress?