使用 Python 自动爬取网页小说并生成 TXT 文件-缙哥哥

280G全国流量的电信星海卡仅需29元/月，长期可续该套餐，随时可注销

研究这个纯粹是因为网友问《剑来》TXT 精校版啥时候更新而研究的，之前已经尝试使用 TextForever 软件将 HTML 网页小说批量一键转换成 TXT 格式，奈何现在的网页小说在正文内也加入了广告，有些甚至还自动分页。后期排版麻烦并且卡顿，至少缙哥哥的现在的笔记本电脑还是比较卡的，现在不是 Python 流行嘛，正好使用 Python 自动爬取网页小说并生成 TXT 文件。

安装 Python 环境

1、下载 Python3.7.3 源码包

wget -c https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tgz

2、解压 Python3.7.3 源码包

tar -xzvf Python-3.7.3.tgz

3、进入 Python3.7.3 目录

cd Python-3.7.3

4、配置安装信息

./configure --with-ssl

“运气不好”的小伙伴可能会遇到以下错误：

configure: error: in `/root/Python-3.7.3':
configure: error: no acceptable C compiler found in $PATH

该错误解决办法：安装 GCC 软件套件

yum install gcc

然后重新执行配置安装信息即可。

5、安装 openssl-devel 支持

yum install openssl-devel

编译并安装

make && make install

6、这时候，你会幸运的发现，又TM出错了：

File "/root/Python-3.7.3/Lib/ctypes/__init__.py", line 7, in <module>
from _ctypes import Union, Structure, Array
ModuleNotFoundError: No module named '_ctypes'
make: *** [install] 错误 1

原因是缺少依赖包，安装 libffi 依赖即可。

yum install libffi-devel -y

这个解决后重新进行第 6 步编译安装。不出意外，会出现Successfully installed这个安装成功的提示！

执行 Python 脚本爬小说

由于该脚本使用了 Python 扩展库，请先安装BeautifulSoup与requests支持。

pip3 install beautifulsoup4
pip3 install requests

然后任意创建个文件夹（或者直接在根目录）放入 17549.py 脚本。

输入python3 17549.py（如果创建了个文件夹，记得先 cd 进去）就开始爬了……

Python 爬网页小说脚本

为了方便大家使用，该脚本已经打包放网盘，可以直接下载使用。

爬小说生成 TXT 示例 Python 脚本下载: http://ct.dujin.org/f/5210373-485800393-60f362

源码如下：

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
import sys
import time

class downloader(object):

def __init__(self,url):
self.target = url # 章节页
self.names = [] # 存放章节名
self.urls = [] # 存放章节链接
self.nums = 0 # 章节数
self.title=""#小说名

def get_one_text(self, url_i):

text = ' '
url_i="https://www.nitianxieshen.com"+url_i
r = requests.get(url=url_i)
r.encoding = r.apparent_encoding

html = r.text
html_bf = BeautifulSoup(html, features='html.parser')
#div = html_bf.find_all('div', attrs={"id":"content"})
#print(div.find('div',attrs={"class":"m-tpage"}))
texts=html_bf.find_all('p')
texts[0].decompose()
texts[len(texts)-1].decompose()
for t in texts:
text += str(t)
text = text.replace('<None>', '')
text = text.replace('</None>', '')
text = text.replace('</div>', '\n')
text = text.replace('<br/>', '\n')
text = text.replace('<p>', '\n')
text = text.replace('</p>', '\n')
text = text.replace('<\p>', '\n')

return text

def get_name_address_list(self):
list_a_bf = []
list_a = []
r = requests.get(self.target)
r.encoding = r.apparent_encoding
html = r.text
div_bf = BeautifulSoup(html, features='html.parser')
self.title=div_bf.find('h1').text
div = div_bf.find_all('div',attrs={"id":"play_0"})[0]
li=div.find_all('li')
self.nums=len(li)
for i in range(len(li)):
self.names.append(li[i].find('a').string) # string方法返回章节名
self.urls.append(li[i].find('a').get('href')) # get（‘href’）返回子地址串
#print(self.names)
#print(self.urls)
print("共："+str(self.nums)+"章")

def writer(self, name, path, text):
write_flag = True
with open(path, 'a', encoding='utf-8') as f: # 打开目标路径文件
f.write(name + '\n')
f.writelines(text)
f.write('\n\n')

 

if __name__ == "__main__":

dl = downloader("https://www.nitianxieshen.com/zhuxian/")
dl.get_name_address_list()

print('《'+dl.title+'》开始下载：')
for i in range(dl.nums):
time.sleep(0.2)
try:
dl.writer(dl.names[i], r''+dl.title+'.txt', dl.get_one_text(dl.urls[i]))
except IndexError as e:
print(repr(e))
sys.stdout.write(" 已下载:%.3f%%" % float((i/dl.nums)*100) + '\r'+'当前第：'+str(i)+' 章')
sys.stdout.flush()
print(dl.title+'下载完成')

想把其他网页小说保存为单个 TXT 文件，只需要修改倒数第二段的小说目录地址即可。

使用 Python 自动爬取网页小说并生成 TXT 文件

安装 Python 环境

执行 Python 脚本爬小说

Python 爬网页小说脚本

相关推荐

评论抢沙发

评论前必须登录！

正版软件特惠

请缙哥哥或其他小伙伴技术支持

WordPress菜鸟建站篇（总结归类）

热门标签

随机推荐

网站统计

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续给力更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

切换注册登录

切换登录注册

安装 Python 环境

执行 Python 脚本爬小说

Python 爬网页小说脚本

相关推荐

评论 抢沙发

评论前必须登录！

正版软件特惠

请缙哥哥或其他小伙伴技术支持

WordPress菜鸟建站篇（总结归类）

热门标签

随机推荐

网站统计

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续给力更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

切换注册登录

切换登录注册

评论抢沙发