metascraper

metascraper

八月 11, 2019

Metascraper

metascraper
## 前言
現在都講求速度的文化連爬蟲都是,因此發現了這個套件metascraper,有一些內建的規自動掃秒html dom文本與一些規則的meta dat

安裝

npm 一如往常簡單

1
$ npm install metascraper --save

基本用法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
const metascraper = require('metascraper')([
require('metascraper-author')(),
require('metascraper-date')(),
require('metascraper-description')(),
require('metascraper-image')(),
require('metascraper-logo')(),
require('metascraper-clearbit')(),
require('metascraper-publisher')(),
require('metascraper-title')(),
require('metascraper-url')()
])

const got = require('got')

const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'

;(async () => {
const { body: html, url } = await got(targetUrl)
const metadata = await metascraper({ html, url })
console.log(metadata)
})()

輸出

1
2
3
4
5
6
7
8
9
{
"author": "Ellen Huet",
"date": "2016-05-24T18:00:03.894Z",
"description": "The HR startups go to war.",
"image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
"publisher": "Bloomberg.com",
"title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
"url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
}

如何運作的

來看看他怎麼運作的吧

核心部分

其實很簡單只做函數柯里化(可以參考這裡)輸入規則函數變數與一些錯誤檢查,並且合併規則

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
'use strict'

const { isUrl } = require('@metascraper/helpers')
const whoops = require('whoops')

const mergeRules = require('./merge-rules')
const loadRules = require('./load-rules')
const loadHTML = require('./load-html')
const getData = require('./get-data')

const MetascraperError = whoops('MetascraperError')

module.exports = rules => {
const loadedRules = loadRules(rules)
return async ({ url, html, rules: inlineRules, escape = true } = {}) => {
if (!isUrl(url)) {
throw new MetascraperError({
message: 'Need to provide a valid URL.',
code: 'INVALID_URL'
})
}
return getData({
url,
escape,
htmlDom: loadHTML(html),
rules: mergeRules(inlineRules, loadedRules)
})
}
}

預設規則

1
2
3
4
5
6
7
8
9
require('metascraper-author')(),
require('metascraper-date')(),
require('metascraper-description')(),
require('metascraper-image')(),
require('metascraper-logo')(),
require('metascraper-clearbit')(),
require('metascraper-publisher')(),
require('metascraper-title')(),
require('metascraper-url')()

上面這些規則裡面都是柯里化(可以參考這裡)的參數,舉例來說metascraper-title長這個樣子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
'use strict'

const { $filter, title } = require('@metascraper/helpers')

const wrap = rule => ({ htmlDom }) => {
const value = rule(htmlDom)
return title(value)
}

module.exports = () => ({
title: [
wrap($ => $('meta[property="og:title"]').attr('content')),
wrap($ => $('meta[name="twitter:title"]').attr('content')),
wrap($ => $('.post-title').text()),
wrap($ => $('.entry-title').text()),
wrap($ => $('h1[class*="title" i] a').text()),
wrap($ => $('h1[class*="title" i]').text()),
wrap($ => $filter($, $('title')))
]
})

就只是把每個規則變成陣列函數,然後遍歷一遍判斷產生的結果然後回傳至最終物件,然後印出來。

實際寫一個 ptt.cc 規則

其實就是用核心html掃描就是用cheeio去實現因此,自己建立規則遵照cheeiojquery方式去寫規則語法與限制

直接來看程式碼

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

/* eslint-disable prefer-const */
'use strict'

const wrap = rule => ({ htmlDom, url }) => {
const value = rule(htmlDom, url)
return value
}

module.exports = () => ({
author: [
wrap($ => $('.article-metaline:first-child .article-meta-value').text())
],
kanban: [wrap($ => $('.article-metaline-right .article-meta-value').text())],
date: [
wrap($ => $('.article-metaline:nth-child(4) .article-meta-value').text())
],
context: [
wrap($ => {
const text = $('#main-content')
.clone()
.children(':not(a)')
.remove()
.end()
.text()
let imageUrls = []
let rowTexts = text.replace(/\n/g, '')
rowTexts = rowTexts.replace(
/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png)/g,
(matched, index, original) => {
imageUrls.push(matched)
return '@@@$#'
}
)
let pureTexts = rowTexts.split('@@@$#')
function rebuild(imageUrls, pureTexts) {
const length =
imageUrls.length >= pureTexts.length
? imageUrls.length
: pureTexts.length
let result = []
for (let i = 0; i < length; i++) {
if (pureTexts[i]) {
result.push({ type: 'TEXT', content: pureTexts[i] })
}
if (imageUrls[i]) {
result.push({ type: 'IMAGE', content: imageUrls[i] })
}
}
return result
}
return {
text: text,
sections: rebuild(imageUrls, pureTexts)
}
})
],
comments: [
wrap($ => {
let content = []
// 空白很重要
function checkReaction(target) {
if (target === '→ ') {
return 'none'
} else if (target === '推 ') {
return 'like'
} else if (target === '噓 ') {
return 'disLike'
} else {
return 'none'
}
}
function checkDate(date) {
const dateArray = date.split(' ')
return (
dateArray[dateArray.length - 2] +
' ' +
dateArray[dateArray.length - 1]
)
}
$('.push').each(function() {
content.push({
reaction: checkReaction(
$(this)
.children('.push-tag')
.text()
),
user: $(this)
.children('.push-userid')
.text(),
content: $(this)
.children('.push-content')
.text(),
date: checkDate(
$(this)
.children('.push-ipdatetime')
.text()
)
})
})
return content
})
]
})

其實可以看到我將整個ptt.cc文章結構化之後輸出,結果像這樣

1
2
3
4
5
6
7
8
9
10
{
"author": "<String>",
"kanban": "<String>",
"date": "<IOS DATE>",
"context": {
"text": "<String>",
"sections": [{imageUrls, pureTexts}]
},
"comments": [{user,content,date}]
}

然後使用mongodb去儲存,就完成ptt的爬蟲了。

寫在文後

會知道這個套件,這其實也是用在工作上的一寫小工具,感覺還蠻有趣的因此就拿來玩玩惹。