Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 17, 2018

Xidel – HTML/XML/JSON data extraction tool

Filed under: Web Scraping,XQuery — Patrick Durusau @ 7:12 pm

Xidel – HTML/XML/JSON data extraction tool

From the webpage:


Features

It supports:

  • Extract expressions:
    • CSS 3 Selectors: to extract simple elements
    • XPath 3.0: to extract values and calculate things with them
    • XQuery 3.0: to create new documents from the extracted values
    • JSONiq: to work with JSON apis
    • Templates: to extract several expressions in an easy way using a annotated version of the page for pattern-matching
    • XPath 2.0/XQuery 1.0: compatibility mode for the old XPath/XQuery version
  • Following:
    • HTTP Codes: Redirections like 30x are automatically followed, while keeping things like cookies
    • Links: It can follow all links on a page as well as some extracted values
    • Forms: It can fill in arbitrary data and submit the form
  • Output formats:
    • Adhoc: just prints the data in a human readable format
    • XML: encodes the data as XML
    • HTML: encodes the data as HTML
    • JSON: encodes the data as JSON
    • bash/cmd: exports the data as shell variables
  • Connections: HTTP / HTTPS as well as local files or stdin
  • Systems: Windows (using wininet), Linux (using synapse+openssl), Mac (synapse)

Xidel is a very good excuse to practice your XML (XPath/XQuery) on a daily basis!

Not to mention being an interchangeable way to share web scraping scripts for websites.

Enjoy!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress