Article to Markdown (with some headers)

A simple script that uses postlight/mercury-parser to convert an article into a neat Markdown file with some headers. #python

Example output

# Why I’m done with Chrome

* **Source:** [blog.cryptographyengineering.com](https://blog.cryptographyengineering.com/2018/09/23/why-im-leaving-chrome/)
* **Author:** Matthew Green
* **Word count:** 2234
* **Extracted at:** 2019-11-15 23:38

![lead image](https://matthewdgreen.files.wordpress.com/2018/09/untitled-3.png)

This blog is mainly reserved for cryptography, and I try to avoid filling it with random ![512px-Google_Chrome_icon_(September_2014).svg](https://matthewdgreen.files.wordpress.com/2018/09/512px-google_chrome_icon_september_2014-svg.png?w=512)“someone is wrong on the Internet” posts. After all, that’s what Twitter is for! But from time to time something bothers me enough that I have to make an exception. Today I wanted to write specifically about Google Chrome, how much I’ve loved it in the past, and why — due to Chrome’s new user-unfriendly [forced login policy](https://news.ycombinator.com/item?id=17942252) — I won’t be using it going forward.

...

Script

import os, sys, json
import datetime
link = str(sys.argv[1])
print("Processing " + link)
resp = json.loads(os.popen("mercury-parser " + link + " --format=markdown").read())
today = datetime.datetime.now()
out_content = resp["content"]
out_title   = resp["title"]
out_url     = resp["url"]
out_domain  = resp["domain"]
out_wc      = resp["word_count"]
if resp["author"]:
    out_author = resp["author"]
else:
    out_author = "Unknown"
# Ewww, www
if "www." in out_domain:
    out_domain = out_domain.replace("www.", "")
if resp["lead_image_url"]:
    out_lead_img = resp["lead_image_url"]
    header = "* **Source:** [" + out_domain + "](" + out_url + ")\n* **Author:** " + out_author + "\n* **Word count:** " + str(out_wc) + "\n* **Extracted at:** " + today.strftime("%Y-%m-%d %H:%M") + "\n\n"
    content = "# " + out_title + "\n\n" + header + "![lead image](" + out_lead_img + ")\n\n" + out_content
else:
    header = "* **Source:** [" + out_domain + "](" + out_url + ")\n* **Author:** " + out_author + "\n* **Word count:** " + str(out_wc) + "\n* **Extracted at:** " + today.strftime("%Y-%m-%d %H:%M") + "\n\n---\n\n"
    content = "# " + out_title + "\n\n" + header + out_content
# Formats the title of the file
title = today.strftime("%Y%m%d-" + out_title)
title = title.lower()
for ch in [" ", " "]:
    if ch in title:
        title = title.replace(ch, "-")
for ch in ["'", ",", "’"]:
    if ch in title:
        title = title.replace(ch, "")
# Writes to the actual file
f = open(os.path.join(os.pardir, "links/" + title + ".md"), "w")
f.write(content)
f.close()
print("Done!")

How to run it

yarn add @postlight/parser
# or
npm install @postlight/parser

python3 script.py https://example.com

Unless specified otherwise, this work is licensed under a Creative Commons BY-NC-SA 4.0.