Article to Markdown (with some headers)
A simple script that uses postlight/mercury-parser to convert an article into a neat Markdown file with some headers. #python
Example output
# Why I’m done with Chrome
* **Source:** [blog.cryptographyengineering.com](https://blog.cryptographyengineering.com/2018/09/23/why-im-leaving-chrome/)
* **Author:** Matthew Green
* **Word count:** 2234
* **Extracted at:** 2019-11-15 23:38

This blog is mainly reserved for cryptography, and I try to avoid filling it with random “someone is wrong on the Internet” posts. After all, that’s what Twitter is for! But from time to time something bothers me enough that I have to make an exception. Today I wanted to write specifically about Google Chrome, how much I’ve loved it in the past, and why — due to Chrome’s new user-unfriendly [forced login policy](https://news.ycombinator.com/item?id=17942252) — I won’t be using it going forward.
...
Script
import os, sys, json
import datetime
link = str(sys.argv[1])
print("Processing " + link)
resp = json.loads(os.popen("mercury-parser " + link + " --format=markdown").read())
today = datetime.datetime.now()
out_content = resp["content"]
out_title = resp["title"]
out_url = resp["url"]
out_domain = resp["domain"]
out_wc = resp["word_count"]
if resp["author"]:
out_author = resp["author"]
else:
out_author = "Unknown"
# Ewww, www
if "www." in out_domain:
out_domain = out_domain.replace("www.", "")
if resp["lead_image_url"]:
out_lead_img = resp["lead_image_url"]
header = "* **Source:** [" + out_domain + "](" + out_url + ")\n* **Author:** " + out_author + "\n* **Word count:** " + str(out_wc) + "\n* **Extracted at:** " + today.strftime("%Y-%m-%d %H:%M") + "\n\n"
content = "# " + out_title + "\n\n" + header + "\n\n" + out_content
else:
header = "* **Source:** [" + out_domain + "](" + out_url + ")\n* **Author:** " + out_author + "\n* **Word count:** " + str(out_wc) + "\n* **Extracted at:** " + today.strftime("%Y-%m-%d %H:%M") + "\n\n---\n\n"
content = "# " + out_title + "\n\n" + header + out_content
# Formats the title of the file
title = today.strftime("%Y%m%d-" + out_title)
title = title.lower()
for ch in [" ", " "]:
if ch in title:
title = title.replace(ch, "-")
for ch in ["'", ",", "’"]:
if ch in title:
title = title.replace(ch, "")
# Writes to the actual file
f = open(os.path.join(os.pardir, "links/" + title + ".md"), "w")
f.write(content)
f.close()
print("Done!")
How to run it
yarn add @postlight/parser
# or
npm install @postlight/parser
python3 script.py https://example.com
Unless specified otherwise, this work is licensed under a Creative Commons BY-NC-SA 4.0.