Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 17, 2018

Xidel – HTML/XML/JSON data extraction tool

Filed under: Web Scraping,XQuery — Patrick Durusau @ 7:12 pm

Xidel – HTML/XML/JSON data extraction tool

From the webpage:


Features

It supports:

  • Extract expressions:
    • CSS 3 Selectors: to extract simple elements
    • XPath 3.0: to extract values and calculate things with them
    • XQuery 3.0: to create new documents from the extracted values
    • JSONiq: to work with JSON apis
    • Templates: to extract several expressions in an easy way using a annotated version of the page for pattern-matching
    • XPath 2.0/XQuery 1.0: compatibility mode for the old XPath/XQuery version
  • Following:
    • HTTP Codes: Redirections like 30x are automatically followed, while keeping things like cookies
    • Links: It can follow all links on a page as well as some extracted values
    • Forms: It can fill in arbitrary data and submit the form
  • Output formats:
    • Adhoc: just prints the data in a human readable format
    • XML: encodes the data as XML
    • HTML: encodes the data as HTML
    • JSON: encodes the data as JSON
    • bash/cmd: exports the data as shell variables
  • Connections: HTTP / HTTPS as well as local files or stdin
  • Systems: Windows (using wininet), Linux (using synapse+openssl), Mac (synapse)

Xidel is a very good excuse to practice your XML (XPath/XQuery) on a daily basis!

Not to mention being an interchangeable way to share web scraping scripts for websites.

Enjoy!

Powered by WordPress