WordPress rules, but I would like my content to be on other platforms as well. Some platforms like DEV, use Markdown, but I seem to struggle to import my articles. That's why I created a small snippet application to convert an article to markdown.
Packages
This solution uses Node.js. NPM has some great packages to work with:
- node-fetch - to download the HTML. Depending on the version of Node.js, you might not need this to implement
fetch. I use version 2, as I don't use ESM. - linkedom - to parse HTML into a workable DOM. I used to use jsdom, but I switch for performance reasons.
- node-html-markdown - to parse HTML into markdown.
Install them like:
npm install node-fetch@2 linkedom node-html-markdown
npm install -D @types/node-fetchSimple scraper
We're going to do the following:
- Fetch the text of the URL. This is HTML, of course.
- Parse it to DOM nodes.
- Detect the article node.
- Convert the article node to Markdown.
This results in the following lines of code:
async function scrape(url: string) {
let f = await fetch(url)
let txt = await f.text()
const { document } = parseHTML(txt)
// custom parsing:
// parseCodeFields(document)
// parseEmbeds(document)
let article = (
document.querySelector('article .entry-content') ||
document.querySelector('article .crayons-article__main') ||
document.querySelector('article') ||
document.querySelector('body'))
let html = article?.innerHTML || ""
let content = NodeHtmlMarkdown.translate(html).trim()
// let header = parseHeader(document)
// content = header + content
return content
}Code Language Support
Now, my WordPress generates <pre class="lang-ts"><code></code></pre> blocks. Looks like node-html-markdown only takes <pre><code class="language-ts></code></pre>. Now, that's easily fixed by adding some extra processing before converting the document to markdown:
function parseCodeFields(document: Document) {
document.querySelectorAll("pre code").forEach(code => {
let lang = [...code.parentElement?.classList || []]
.filter(x => x.startsWith("lang-"))
.find(x => x)
if(!lang) return
lang = lang.replace("lang-", "language-")
code.classList.add(lang)
})
}Embed rich content
Fortunately, dev.to supports liquid tags to embed rich content like repl.it and tweets. Let's parse our iframe elements into a liquid tag:
function parseEmbeds(document: Document) {
document.querySelectorAll('iframe').forEach(iframe => {
if (!iframe.src) return
const url = new URL(iframe.src)
const type = url.host
const name = url.pathname
const p = document.createElement("p")
const n = document.createTextNode(`{% ${type} ${name} %}`)
p.appendChild(n)
iframe.parentNode?.insertBefore(p, iframe)
})
}This will not work for every embed, but it will get you started.
Header support
To be complete, we also need to add a YAML header with the title, tags and the canonical URL. It requires some parsing, but it'll make things easier:
function parseHeader(document: Document) {
let header = '---\n'
let title = (document.querySelector('h1')?.textContent || '').trim()
if (title) {
header += `title: ${title}\n`
}
let tags = [...document.querySelectorAll(".categories a, .tags a")]
.map(a => (a.textContent || '').trim().toLowerCase())
.filter(t => t)
if (tags.length > 0) {
tags.sort()
let t = [... new Set(tags)].join(", ")
header += `tags: [${t}]\n`
}
let canonical = document.querySelector('link[rel=canonical]')?.getAttribute("href")
if (canonical) {
header += `canonical_url: ${canonical}\n`
}
header += '---\n\n'
return header;
}Final thoughts
I still need to find a better way to detect the language of code snippets, so I don't have to add them by hand. When I look at the result, I know one thing for sure: I'll keep using WordPress to write my blogs, as Markdown does not make it more readable!
Oh, and when you read this post on dev.to: it was created using this code (and yes, that's super meta 🤓).
Changelog
- The reppl.it program no longer works when framed in, so I have removed it.
- Added title support.
- Added YAML header support (see Header support)
- Fixed language support for WordPress code fields (see Code Language Support).
- Initial article.