本文共 5093 字,大约阅读时间需要 16 分钟。
一、MySQL配置
1、my.ini配置
找到 [mysqld]在下面添加 skip-grant-tables和character-set-server=utf8 找到[mysql]、[client]在下面添加default-character-set=utf8
重启mysql服务
(注:)如果已有的话就不需要添加
2、创建数据库(手动创建数据库nutch)与表
01 | CREATE TABLE `webpage` ( |
02 | `id` varchar (767) CHARACTER SET latin1 NOT NULL , |
05 | `status` int (11) DEFAULT NULL , |
08 | `modifiedTime` bigint (20) DEFAULT NULL , |
09 | `score` float DEFAULT NULL , |
10 | `typ` varchar (32) CHARACTER SET latin1 DEFAULT NULL , |
11 | `baseUrl` varchar (512) CHARACTER SET latin1 DEFAULT NULL , |
13 | `title` varchar (2048) DEFAULT NULL , |
14 | `reprUrl` varchar (512) CHARACTER SET latin1 DEFAULT NULL , |
15 | `fetchInterval` int (11) DEFAULT NULL , |
16 | `prevFetchTime` bigint (20) DEFAULT NULL , |
19 | `outlinks` mediumblob, |
20 | `fetchTime` bigint (20) DEFAULT NULL , |
21 | `retriesSinceFetch` int (11) DEFAULT NULL , |
22 | `protocolStatus` blob, |
25 | `batchId` varchar (500) DEFAULT NULL , |
27 | ) ENGINE=InnoDB DEFAULT CHARSET=utf8; |
注:表中的字段根据nutch的conf文件“gora-sql-mapping”进行设置。同时也可通过自动方式生成数据库和表:配置好“gora-sql-mapping”、“gora.properties”及其它文件后,首次通过运行”bin/nutch inject urls”即可自动生成数据库和表,不过或许在自动生成的时候你会遇到问题,不过没有关系,通过及时查看hadoop.log文件你便会发现问题所在。
二、Nutch的安装与配置以及使用
1、Nutch-2.2.X下载:下载,然后解压至本地安装目录,如本地根目录为${NUTCH_HOME};
2、配置nutch对mysql的支持,修改${APACHE_NUTCH_HOME}/ivy/ivy.xml文件,分别:
1)找到以下行取消注释
1 | < dependency org = "mysql" name = "mysql-connector-java" rev = "5.1.18" conf="*->default"/> |
2)修改以下行
默认为
1 | < dependency org = "org.apache.gora" name = "gora-core" rev = "0.3" conf="*->default"/> |
修改后为
1 | < dependency org = "org.apache.gora" name = "gora-core" rev = "0.2.1" conf="*->default"/> |
3)取消注释以下行
1 | < dependency org = "org.apache.gora" name = "gora-sql" rev = "0.1.1-incubating" conf="*->default" /> |
注释:上2)、3)如果不修改会有异常异常信息为
Exception in thread “main” Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore
3、数据库连接配置
编辑${NUTCH_HOME}/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:
01 | ############################### |
03 | ################################ |
04 | gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver |
06 | gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true |
08 | gora.sqlstore.jdbc.user=xxx(mysql用户名) |
10 | gora.sqlstore.jdbc.password=xxx(mysql密码) |
写上你需要连接的数据库地址以及用户名密码
4、修改nutch-site配置文件
将以下内容添加至${NUTCH_HOME}/conf/nutch-site.xml中的configuration节点中
03 | < name >http.agent.name</ name > |
05 | < value >LiuXun Nutch Spider</ value > |
11 | < name >http.accept.language</ name > |
13 | < value >ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</ value > |
15 | < description >Value of the “Accept-Language” request header field. |
17 | This allows selecting non-English language as default one to retrieve. |
19 | It is a useful setting for search engines build for certain national group. |
27 | < name >parser.character.encoding.default</ name > |
31 | < description >The character encoding to fall back to when no other information |
33 | is available</ description > |
39 | < name >storage.data.store.class</ name > |
41 | < value >org.apache.gora.sql.store.SqlStore</ value > |
43 | < description >The Gora DataStore class for storing and retrieving data. |
45 | Currently the following stores are available: …. |
51 | < name >generate.batch.id</ name > |
特别要添加以下内容:
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
否则通过”bin/nutch crawl urls –threadsn –depths n”爬取网页时,在日志中会看到以下错误: java.lang.NullPointerException atorg.apache.avro.util.Utf8.<init>(Utf8.java:37) atorg.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100) atorg.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) atorg.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) atorg.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
并且“nutch-site”文件需要保存为utf-8格式,否则在执行nutch命令时会出现以下错误:
Exception in thread “main”java.lang.RuntimeException:com.sun.org.apache.xerces.internal.impl.io.malformedByteSequenceException: 1字节的UTF-8序列的字节 1 无效。
5、编译Nutch-2.2.*
1)首先安装Ant
2)进入${NUTCH_HOME}目录下执行ant命令既可
3)编译成功后${NUTCH_HOME}目录下会有runtime这个目录
注意:[ivy:resolve] :: loading settings :: file = /home/appmon/release-2.2.1/ivy/ivysettings.xml这里要花一点时间联网检测,等一会就会继续,大约2-5分钟,如果很久没有反应,则Ctrl+C结束,再重新运行ant
6、网页抓取以及配置
1)进入${NUTCH_HOME}/runtime/local目录下
2)设置抓取的网站
执行命令
2 | echo 'http://www.oschina.net/' > urls/seed.txt |
3)爬取操作
1 | bin/nutch crawl urls -depth 3 -topN 5 |
(如果出现如下错误:log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /usr/local/apache-nutch-2.2.1/runtime/local/logs/hadoop.log (No such file or directory),
是因为log4j.properties 和其他的应用程序重名的原因。进入 nutch/runtime/local/conf中 ,对log4j.properties换个名字即可:
sudo mv log4j.properties log.properties)
nutch命令前面章节介绍到了
执行完在mysql中即查看到爬虫抓取的内容,如下图:
转自:
转载地址:http://qispi.baihongyu.com/