## 前言
現在都講求速度的文化連爬蟲都是,因此發現了這個套件metascraper,有一些內建的規自動掃秒html dom
文本與一些規則的meta dat
。
安裝
npm 一如往常簡單
1
| $ npm install metascraper --save
|
基本用法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| const metascraper = require('metascraper')([ require('metascraper-author')(), require('metascraper-date')(), require('metascraper-description')(), require('metascraper-image')(), require('metascraper-logo')(), require('metascraper-clearbit')(), require('metascraper-publisher')(), require('metascraper-title')(), require('metascraper-url')() ])
const got = require('got')
const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'
;(async () => { const { body: html, url } = await got(targetUrl) const metadata = await metascraper({ html, url }) console.log(metadata) })()
|
輸出
1 2 3 4 5 6 7 8 9
| { "author": "Ellen Huet", "date": "2016-05-24T18:00:03.894Z", "description": "The HR startups go to war.", "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg", "publisher": "Bloomberg.com", "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance", "url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance" }
|
如何運作的
來看看他怎麼運作的吧
核心部分
其實很簡單只做函數柯里化(可以參考這裡)輸入規則函數變數與一些錯誤檢查,並且合併規則
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| 'use strict'
const { isUrl } = require('@metascraper/helpers') const whoops = require('whoops')
const mergeRules = require('./merge-rules') const loadRules = require('./load-rules') const loadHTML = require('./load-html') const getData = require('./get-data')
const MetascraperError = whoops('MetascraperError')
module.exports = rules => { const loadedRules = loadRules(rules) return async ({ url, html, rules: inlineRules, escape = true } = {}) => { if (!isUrl(url)) { throw new MetascraperError({ message: 'Need to provide a valid URL.', code: 'INVALID_URL' }) } return getData({ url, escape, htmlDom: loadHTML(html), rules: mergeRules(inlineRules, loadedRules) }) } }
|
預設規則
1 2 3 4 5 6 7 8 9
| require('metascraper-author')(), require('metascraper-date')(), require('metascraper-description')(), require('metascraper-image')(), require('metascraper-logo')(), require('metascraper-clearbit')(), require('metascraper-publisher')(), require('metascraper-title')(), require('metascraper-url')()
|
上面這些規則裡面都是柯里化(可以參考這裡)的參數,舉例來說metascraper-title
長這個樣子
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| 'use strict'
const { $filter, title } = require('@metascraper/helpers')
const wrap = rule => ({ htmlDom }) => { const value = rule(htmlDom) return title(value) }
module.exports = () => ({ title: [ wrap($ => $('meta[property="og:title"]').attr('content')), wrap($ => $('meta[name="twitter:title"]').attr('content')), wrap($ => $('.post-title').text()), wrap($ => $('.entry-title').text()), wrap($ => $('h1[class*="title" i] a').text()), wrap($ => $('h1[class*="title" i]').text()), wrap($ => $filter($, $('title'))) ] })
|
就只是把每個規則變成陣列函數,然後遍歷一遍判斷產生的結果然後回傳至最終物件,然後印出來。
實際寫一個 ptt.cc 規則
其實就是用核心html掃描
就是用cheeio去實現因此,自己建立規則遵照cheeio與jquery方式去寫規則語法與限制
直接來看程式碼
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
|
'use strict'
const wrap = rule => ({ htmlDom, url }) => { const value = rule(htmlDom, url) return value }
module.exports = () => ({ author: [ wrap($ => $('.article-metaline:first-child .article-meta-value').text()) ], kanban: [wrap($ => $('.article-metaline-right .article-meta-value').text())], date: [ wrap($ => $('.article-metaline:nth-child(4) .article-meta-value').text()) ], context: [ wrap($ => { const text = $('#main-content') .clone() .children(':not(a)') .remove() .end() .text() let imageUrls = [] let rowTexts = text.replace(/\n/g, '') rowTexts = rowTexts.replace( /(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png)/g, (matched, index, original) => { imageUrls.push(matched) return '@@@$#' } ) let pureTexts = rowTexts.split('@@@$#') function rebuild(imageUrls, pureTexts) { const length = imageUrls.length >= pureTexts.length ? imageUrls.length : pureTexts.length let result = [] for (let i = 0; i < length; i++) { if (pureTexts[i]) { result.push({ type: 'TEXT', content: pureTexts[i] }) } if (imageUrls[i]) { result.push({ type: 'IMAGE', content: imageUrls[i] }) } } return result } return { text: text, sections: rebuild(imageUrls, pureTexts) } }) ], comments: [ wrap($ => { let content = [] function checkReaction(target) { if (target === '→ ') { return 'none' } else if (target === '推 ') { return 'like' } else if (target === '噓 ') { return 'disLike' } else { return 'none' } } function checkDate(date) { const dateArray = date.split(' ') return ( dateArray[dateArray.length - 2] + ' ' + dateArray[dateArray.length - 1] ) } $('.push').each(function() { content.push({ reaction: checkReaction( $(this) .children('.push-tag') .text() ), user: $(this) .children('.push-userid') .text(), content: $(this) .children('.push-content') .text(), date: checkDate( $(this) .children('.push-ipdatetime') .text() ) }) }) return content }) ] })
|
其實可以看到我將整個ptt.cc文章結構化之後輸出,結果像這樣
1 2 3 4 5 6 7 8 9 10
| { "author": "<String>", "kanban": "<String>", "date": "<IOS DATE>", "context": { "text": "<String>", "sections": [{imageUrls, pureTexts}] }, "comments": [{user,content,date}] }
|
然後使用mongodb去儲存,就完成ptt的爬蟲了。
寫在文後
會知道這個套件,這其實也是用在工作上的一寫小工具,感覺還蠻有趣的因此就拿來玩玩惹。