prometheus 监控一

Posted on 2019/07/05 by Arts

关于prometheus 我们都知道它当前是一个开源的监控软件，社区活跃，使用的人也非常多。
今天主要就是针对当前我在配置prometheus的时候遇到的一些点，然后针对配置做一个简单的介绍，下面我去官网下载了一个二进制包，然后直接拿官网的prometheus.yml来说，如下：

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

# my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

# scrape_timeout is set to the global default (10s).

# Alertmanager configuration

alerting:

alertmanagers:

- static_configs:

- targets:

# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

# - "first_rules.yml"

# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

- job_name: 'prometheus'

# metrics_path defaults to '/metrics'

# scheme defaults to 'http'.

static_configs:

- targets: ['localhost:9090']

其中我们可以看到
scrape_interval：1m 表示抓取周期
evaluation_interval:30s 表示计算prometheus 告警规则的周期
scrape_timeout:15s 表示抓取超时时间

注意以上三个项基本定义了抓取数据和计算规则，抓取数据的周期和计算规则的周期是不一样的，抓取数据单独的周期去1m抓取一次，然后计算规则会按照计算规则的周期去30s计算一次。

alerting:
alertmanagers:
– static_configs:
– targets:
# – alertmanager:9093 表示alertmanager的地址
这里prometheus主要负责抓取数据，然后计算规则，产生告警，一旦产生告警后，将会通知到alertmanager做对应的规则路由

rule_files:
# – “first_rules.yml”
# – “second_rules.yml”
表示具体的规则定义，这里我们可以定义多级别的目录，然后使用*.yml这样匹配所有的告警的文件

scrape_configs:
– job_name: ‘prometheus’
# metrics_path defaults to ‘/metrics’
# scheme defaults to ‘http’.
static_configs:
– targets: [‘localhost:9090’]

job_name 我理解主要是分组，然后默认static_configs 这里我们是采用静态的配置，也就是我们写好配置文件后，只能重启服务或者掉接口热加载才会生效。
targets：表示具体的ip+port 去拉去监控的数据，然后默认的路径就是/metrics,也就是： http://localhost:9090/me tics

注意，这里我们可以在静态target的地方打上相应的标签，然后在拉取数据上来后，就会打上相应的标签在数据上
比如：
static_configs:
– targets: [‘localhost:9090’]
labels:
endpoint:”test”

由于agent 采集的数据很多，这个时候我们为了节省一部分空间，或者少看一些数据，我们可以使用metric_relabel_configs 去做一些筛选

– job_name: host
metric_relabel_configs:
– source_labels: [__name__]
regex: (node_cpu.*|node_disk.*)
action: keep
比如，我就可以通过这样的操作去保留cpu, disk的数据，这里是根据监控项metric 去做正则表达式匹配保留,不在上面的我们就可以丢弃了。然后针对action我们可以定义保留keep 或者丢弃drop

有时候我们可能会想prometheus自动检测文件的变化，然后改动完文件即自动拉取对应的target
那么我们可以通过如下配置配上：

– job_name: host
metric_relabel_configs:
– source_labels: [__name__]
regex: (node_cpu.*|node_disk.*)
action: keep
file_sd_configs:
– files: [‘host.yml’]

注意，这里只能监控拉取的target的变化，但是不能监控规则的变化。

golang 遇坑strings.TrimLeft

Posted on 2018/05/102018/05/10 by Arts

本文主要记录使用此库函数遇到的问题
直接上代码：

func main() {
cfgs := "mongodb://off"
cfgs = strings.TrimLeft(cfgs, "mongodb://")
fmt.Printf("cfgs:%v\n",cfgs)
//output ==> cfgs:ff
}

func main() {

cfgs := "mongodb://off"

cfgs = strings.TrimLeft(cfgs, "mongodb://")

fmt.Printf("cfgs:%v\n",cfgs)

//output ==> cfgs:ff

}

像上面的内容，我们实际上期望的得到的输出是 off
但实际的输出是：ff

然后找了一下，提了个issue到golang的官方github,结果人家秒回：
I think you want https://godoc.org/strings#TrimPrefix instead.

然后我们看一下官方的库的解释：

func TrimLeft(s string, cutset string) string
TrimLeft returns a slice of the string s with all leading Unicode code points contained in cutset removed.

1 2	func TrimLeft(s string, cutset string) string TrimLeft returns a slice of the string s with all leading Unicode code points contained in cutset removed.

大概的意思就是说，它会从字符串的左边开始找，然后找包含了cutset的自字符，然后直到找不到为止，然后把最后的找到的自负的左边字符串移除。
就好比

TrimLeft("hello Tom","hl")

1	TrimLeft("hello Tom","hl")

它会找hl两个自负，然后找到发现为止在 hell,然后就把hell移除了

最后我们要达到我们的效果，我们应该用什么函数呢？建议的是用TrimPrefix：

func TrimPrefix(s, prefix string) string
TrimPrefix returns s without the provided leading prefix string. If s doesn't start with prefix, s is returned unchanged.

1 2	func TrimPrefix(s, prefix string) string TrimPrefix returns s without the provided leading prefix string. If s doesn't start with prefix, s is returned unchanged.

大概的意思就是说，它会把你的前缀移除掉，然后返回剩下的自负，如果没有找到你想要的前缀，则原样返回字符串

记录遇到的坑，也是一个学习过程的进步

基础TCP&HTTP微服务架构设计

Posted on 2018/04/08 by Arts

针对当前的物联网服务的大力使用，以及业务层的微服务化，我们在处理大多出NB-IOT的物联网服务，我们经常遇到硬件端采用的是TCP链接，链接到服务器，然后自定义协议来与服务器通信，完成硬件到后台服务器的链接，同时我们可能又需要部署多节点，多台机器的集群.采用k8s集群管理和部署。

针对HTTP层，自然都采用以模块化的形式输出类似用户中心或者购物车中心之类的微服务，同样也是采用k8s的集群化管理。

那么对于HTTP层到TCP层的控制，可能我们最重要的就是发部分控制指令来控制硬件的开关或者配置。

这时候的难点就在于，我们两边都是集群的方式，我们如何去找到某个SN的硬件链接的TCP服务器，并且正确的处理完业务逻辑后，封包发送到TCP层硬件锁链接的node节点？

对于目前的业务形态，可能我们就需要分层来处理，每一层的需求，一下是我根据我目前的经验分层：

1.TCP层，只负责处理TCP层的链接，收发完数据后，把数据发送到相应的worker业务端处理后端逻辑

2.worker层，只负责处理业务逻辑，封包逻辑，封包完后，发送到相应的TCP端，TCP端负责发送到硬件。或者是收到硬件的上报的数据，解析处理，处理完成后，针对不同的需求做推送，或者更新存储的数据。同时处理上层HTTP端发来的请求，封包发送到硬件，完成基本的配置。

3.HTTP层，负责提供处接口，供来自网页端，或者APP端，微信这类的第三方的调用，然后完成签名校验后，发送消息到worker端，worker负责封包发给硬件，完成上层业务到硬件端的配置调用。

4.所有的层与层之间的通信，可以通过MQ集群来实现异步的通信过程

5.worker端功效路由的问题，我目前是采用了redis的存储来共享路由信息，worker端处理完成后，通过redis读取路由信息，然后根据相应的路由信息推消息到相应的TCP端监听的MQ队列。TCP 的MQ队列收到消息后，将最后的封包发送到对应的硬件，完成整个链路的通信处理。

使用或者转载本问相关内容，请注明出处，并告最原始作者。

技术上构建微服务应对需求的不断变更

Posted on 2017/10/29 by Arts

针对现在的大部分企业，都是网站，后台服务，APP的模式。要是谁家开科技公司特么没一个APP估计才是怪事了，这时候我们一定会面临的问题就是页面和APP的需求不一致，对后台服务提供的数据要求不一致。当然还有需求不断变更这个永恒的话题，在我们不断开发新功能的时候，却永远都要维护以前的版本功能可用，就像是腾讯的QQ一样，你至今拿起古董级别的QQ，它依旧是可用的。那么我们如何解决这样的问题呢？

我简单的就我目前工作的经验简单谈下此问题，也简单的记录下，期望今后如果创业的时候还能翻起，用于给予我部分提点。

目前的软件走的速度就是谁先上线谁得天下的时代，所以大部分的公司都会以快作为第一要素，996是常事。但是又不面因为快给后续的开发带来很多的兼容问题，就好比V1版本的结构定义的数据处理模式可能是结构{A,B,C},但是目前需求要求必须加一个字段变为{A,B,C,D}，那么如何兼容以前的接口呢，可能这时候最方便的做法就是再加一个接口给新的APP或者网站使用，然后再重写一份代码。那么有什么好的解决办法呢？

理想状态应该是，一个APP或者网站，它是需要什么数据就拿什么数据的状态，而不管服务器会提供什么数据，当然提供的数据它都可以拿，但是拿多少，拿什么字段，应该是客户段决定的。这样就能很好的解决APP的兼容旧借口，旧的APP只需要ABC三个字段，那么在发布出去的时候它就拿ABC三个字段，之后的需求变更增加字段也不会影响到旧版本的APP的使用，因为它还是拿原来的三个字段，数据还是保持原来的格式。而新版本的APP，它也可以根据自己的需求多拿D字段的内容。

那么有什么好的解决方案呢？

其实呢，Facebook有一套很好的解决方案，就是graphql，可能很多人知道reset api，但是不一定知道graphql，因为我们目前的开发技术选型Go, 所以针对此类问题，我们选型在：graphql

项目地址：https://github.com/graphql-go/graphql

它属于一种模式定于语言吧，DSL, 同样支持int,bool,string,list这类基本数据类型，可以通过模式的定义形成数据结构。

它可以通过定义query 作为查询，定义Mutation用作更新数据操作，同时定义数据结构时可以加上对数据字段的描述，服务启动后会自动生成doc，这样在APP端开发时即可通过直接访问服务器便可以查看相关的字段的定义文档，免去了部分的沟通时间。

graphql.NewSchema(
   graphql.SchemaConfig{
      Query: graphql.NewObject(
         graphql.ObjectConfig{
            Name:        "Query",
            Description: "查询所有User相关的信息",
            Fields: graphql.Fields{
               "user": &graphql.Field{
                  Type: types.GLUser,
                  Args: graphql.FieldConfigArgument{
                     "userid": &graphql.ArgumentConfig{
                        Type:        graphql.NewNonNull(graphql.String),
                        Description: "通过user id获得所有user关联信息",
                     },
                  },
                  Resolve: queryUser,
               },
            },
         }),

graphql.NewSchema(

graphql.SchemaConfig{

Query: graphql.NewObject(

graphql.ObjectConfig{

Name: "Query",

Description: "查询所有User相关的信息",

Fields: graphql.Fields{

"user": &graphql.Field{

Type: types.GLUser,

Args: graphql.FieldConfigArgument{

"userid": &graphql.ArgumentConfig{

Type: graphql.NewNonNull(graphql.String),

Description: "通过user id获得所有user关联信息",

Resolve: queryUser,

}),

这样我们即可定义相关的查询接口，APP端即可通过userid 用户ID查询用户的相关信息，并且我们还可以通过不断的增加Args来扩充我们的字段，提供更多的新功能给新版本的app使用，同时依旧兼容旧版本的APP。同时通过Description 又很好的描述了字段的定义，提供了相应的开发文档。

针对变更部分

Mutation: graphql.NewObject(
	graphql.ObjectConfig{
		Name:        "Mutation",
		Description: "更新用户信息"
		Fields: graphql.Fields{
		"updateUser": &graphql.Field{
			Type: types.GLStructUser,
			Args: graphql.FieldConfigArgument{
				"userid": &graphql.ArgumentConfig{
				       Type:        graphql.String,
				       Description: "用户编号",
					},
					"phone": &graphql.ArgumentConfig{
					 Type:        graphql.String,
					Description: "用户手机号",
					},
					"nickname": &graphql.ArgumentConfig{
					Type:        graphql.String,
					Description: "用户昵称",
					},
					},
					Resolve: updateUser,
				},
				},
	}),

Mutation: graphql.NewObject(

graphql.ObjectConfig{

Name: "Mutation",

Description: "更新用户信息"

Fields: graphql.Fields{

"updateUser": &graphql.Field{

Type: types.GLStructUser,

Args: graphql.FieldConfigArgument{

"userid": &graphql.ArgumentConfig{

Type: graphql.String,

Description: "用户编号",

"phone": &graphql.ArgumentConfig{

Type: graphql.String,

Description: "用户手机号",

"nickname": &graphql.ArgumentConfig{

Type: graphql.String,

Description: "用户昵称",

Resolve: updateUser,

}),

这样我们便可以定义相关的修改和更新接口，同样的参数可以不断的变更，我们通过resolve定义的方法处理用户的请求，更新用户的数据。

那么返回什么数据呢？

var GLUserConfig = graphql.ObjectConfig{
	Name: "User",
	Fields: graphql.Fields{
		"id": &graphql.Field{
			Type:        graphql.String,
			Description: "用户编号",
			Resolve:     id.IDResolve,
		},
		"nickname": &graphql.Field{
			Type:        graphql.String,
			Description: "用户昵称",
		},
		"avatar": &graphql.Field{
			Type:        graphql.String,
			Description: "头像信息“，
		"phone": &graphql.Field{
			Type:        graphql.String,
			Description: "电话号码",
		},
		"sex": &graphql.Field{
			Type:        GLUserSex,
			Description: "性别",
		},
         },

var GLUserConfig = graphql.ObjectConfig{

Name: "User",

Fields: graphql.Fields{

"id": &graphql.Field{

Type: graphql.String,

Description: "用户编号",

Resolve: id.IDResolve,

"nickname": &graphql.Field{

Type: graphql.String,

Description: "用户昵称",

"avatar": &graphql.Field{

Type: graphql.String,

Description: "头像信息“，

"phone": &graphql.Field{

Type: graphql.String,

Description: "电话号码",

"sex": &graphql.Field{

Type: GLUserSex,

Description: "性别",

我们可以通过定义相应的返回结构，返回相应的数据即可，当然如果我们有新的需求，我们也可以增加相应的字段作为返回数据。

那么关键点在于graphql定义了这些模式后，它的查询语言了，无论是查询或者查询返回的数据，你都可以根据你的需求获取相应的字段用作自己的APP服务。
定义完了数据结构，我们就可以通过http的请求用语请求所有想要的业务数据
1. 可能我们某次查询只需要昵称和ID，那么我们就可以这样取我们需要的数据

{
  user(id: 3500401) {
    id,
    nickname,
  }
}

{

user(id: 3500401) {

id,

nickname,

}

2. 如果某次我们的新版APP需要更多的数据，那么我们就可以增加相应的字段，取更多APP需要的数据，而接口缺不需要做任何变更：

{
  user(id: 3500401) {
    id,
    nickname,
    phone，
    sex，
  }
}

{

user(id: 3500401) {

id,

nickname,

phone，

sex，

}

同理修改用户数据的接口也是一样的用法。

总结：
在增加新需求的时候，旧的接口可重用，但是又不影响旧的业务，同时支持新的业务。
在当前的业务不断变化的软件后台服务开发中，我们就可以通业务的分离，不断的形成微服务的形式，通过微服务的互相协作，应对新的需求，同时也保证旧版本的服务，支持更多的新服务和业务需求。

（有说得不到位的不对的，欢迎拍砖。如有转载，请增加原文链接，署名原作者。

简单的爬虫实验

Posted on 2017/08/162017/08/16 by Arts

说下背景，起因是因为公司业务上有一块，功能出现了问题，用户的设备端在去请求一个xml 文件的时候，发现从服务器总是下载出错，或者下载超级的慢，因为现有的环境是国外的客户的设备全部都链接到了国内的阿里云的服务器，然后导致下载异常的慢，所以现在想让过外的客户在下载文件的时候，可以判断如果设备端在国外，那么就重定向去新加坡的阿里云OSS 下载，否则国内的IP地址的用户就重定向到国内的阿里云OSS地址下载.

1. 首先，设备端是通过http请求来下载文件的，所以我唯一可以知道的是设备端连接过来的remoteaddr 地址.
2. 设个时候我就可以通过remoteaddr 地址去判断用户的设备到底处于国内还是国外，因为用户的设备有可能是移动的
3. 这个时候就找到了一个淘宝的IP地址库查询接口：http://ip.taobao.com/ipSearch.php
4. 就根据这个区请求查询IP地址的位置，然后做相应的地址重定向
5. 存在的问题，淘宝IP地址库的查询请求是有频率限制的，所以会存在频繁查询查询失败的情况，这个时候是默认跳新加坡的，因为我们主要的客户在国外

好了一下说正事~

淘宝的IP库，看起来就是给出了比较详细的信息：

{
    "code": 0,
    "data": {
        "country": "中国",
        "country_id": "CN",
        "area": "华南",
        "area_id": "800000",
        "region": "广东省",
        "region_id": "440000",
        "city": "广州市",
        "city_id": "440100",
        "county": "",
        "county_id": "-1",
        "isp": "电信",
        "isp_id": "100017",
        "ip": "14.215.177.38"
    }
}

{

"code": 0,

"data": {

"country": "中国",

"country_id": "CN",

"area": "华南",

"area_id": "800000",

"region": "广东省",

"region_id": "440000",

"city": "广州市",

"city_id": "440100",

"county": "",

"county_id": "-1",

"isp": "电信",

"isp_id": "100017",

"ip": "14.215.177.38"

}

这个时候，我就像做一个我自己的地址库，然后让别人来查，看了下，上面有的信息，这时候网上了查了下，好像可行，就开始动手了.
需要的信息：
1. 国家，国家代码
2. 省，省代码
3. 市，市代码
4. isp
5. IP地址库
相应的地址，在代码里面有，需要的可以看代码的请求地址

去网上搜索了一下，好像这些大概都可以找到,接下来就去爬下来就好了（无奈IP地址库的信息，现在我只找到了省级以上的地址库的信息，最后也没找全.

国家代码再维基百科上爬的

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib2
import time
import re
from bs4 import BeautifulSoup


class HtmlDownloader(object):
    header = {'Cookie': 'AD_RS_COOKIE=20083363',
              'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
              AppleWeb\Kit/537.36 (KHTML, like Gecko)\
              Chrome/58.0.3029.110 Safari/537.36'}
    def download(self,url):
        if url is None:
            raise Exception('url is None')
        # print url
        request = urllib2.Request(url,None,HtmlDownloader.header)
        try:
            resp = urllib2.urlopen(request)
            # print resp.getcode()
            if resp.getcode()!= 200:
                time.sleep(5)
                return self.download(url)
            else:
                return resp.read()
        except urllib2.URLError,e:
            print e
            time.sleep(5)
            return self.download(url)
    def readhtml(self,filename):
        file_object = open(filename)
        try:
            all_the_text = file_object.read()
        finally:
            file_object.close()
        return all_the_text



class HtmlParser(object):
    def has_tag(self,tag):
        return tag.has_attr('span')
    def region_parser(self,html_content):
        if html_content is None:
            raise Exception('html is None')
        soup = BeautifulSoup(html_content,'html.parser')
        for tag in soup.find_all(class_="MsoNormal"):
            # print tag.get_text()
            id = tag.get_text().split(" ")[0].strip()
            name = tag.get_text().split(" ")[1].strip()
            print id+"-->"+name
    def contry_parse(self,html_content):
        if html_content is None:
            raise Exception('html is None')
        soup = BeautifulSoup(html_content, 'html.parser')
        for tag in soup.find_all(class_="wikitable sortable"):
            # print tag
            # print tag.select('td')
            i = 0
            for td in tag.select('td'):
                if i % 5 == 0:
                    print "id-->"+td.get_text().strip()
                elif i % 5 == 4:
                    print "name-->"+td.get_text().strip()
                # print str(i)+"----"+td.get_text().strip()
                i = i + 1
            # code = tag.get_text().split(" ")[0].strip()
            # name = tag.get_text().split(" ")[3].strip()
            # print code + "-->" + name
    def contry_ipaddrlink_parse(self,html_content):
        if html_content is None:
            raise Exception('html is None')
        soup = BeautifulSoup(html_content, 'html.parser')
        for tag in soup.find_all(href=re.compile(u'http://ipblock.chacuo.net/view/.*')):
            print  tag.get_text()+"-->"+tag.get('href')
            # html_content = html_downloader.download(tag.get('href'))
            # print tag

    def ipaddress_parse(self,html_content):
        if html_content is None:
            raise Exception('html is None')
        soup = BeautifulSoup(html_content, 'html.parser')
        for tag in soup.find_all('pre'):
            # print tag.get_text()+"-->"
            return tag.get_text()

    def ipaddress_parse_text(self,html_content):
        if html_content is None:
            raise Exception('html is None')
        soup = BeautifulSoup(html_content, 'html.parser')
        for tag in soup.find_all(href=re.compile(u'http://ipblock.chacuo.net/view/.*')):
            # print re.sub(r'view/c_', "down/t_txt=c_", tag.get('href'))
            content = html_downloader.download(re.sub(r'view/c_', "down/t_txt=c_", tag.get('href')))
            try:
                print tag.get_text()
                contentstr = self.ipaddress_parse(content)
                # print contentstr
                for ipdata in contentstr.split('\r\n'):
                    data = ipdata.split('\t')
                    # print data[1]
                    # print data[0]
                    # print '-->'+ipdata
                    # print ipdata.split('\t')[0]
                    # print data[0].strip()
                    if len(data) >3:
                        print '--->ip:'+data[0]+'--->mask:'+data[1]+'-->mask/len:'+data[2]+'-->num:'+data[3]
            except Exception,e:
                print "no data"

    def s_ipaddress_parse_text(self, html_content):
        if html_content is None:
            raise Exception('html is None')
        soup = BeautifulSoup(html_content, 'html.parser')
        for tag in soup.find_all(href=re.compile(u'http://ips.chacuo.net/view/.*')):
            # print re.sub(r'view/c_', "down/t_txt=c_", tag.get('href'))
            content = html_downloader.download(re.sub(r'view/s_', "down/t_txt=p_", tag.get('href')))
            # print tag.get('href')+"--->"
            try:
                print tag.get_text()
                contentstr = self.ipaddress_parse(content)
                # print contentstr
                for ipdata in contentstr.split('\r\n'):
                    # print ipdata
                    data = ipdata.split('\t')
                    # print data
                #     # print data[1]
                #     # print data[0]
                #     # print '-->'+ipdata
                #     # print ipdata.split('\t')[0]
                #     # print data[0].strip()
                    if len(data) > 2:
                        print '--->ip:' + data[0] + '--->mask:' + data[1] + '-->num:' + data[2]
            except Exception, e:
                print "no data"
    def isp_ipaddress_parse_text(self, html_content):
        if html_content is None:
            raise Exception('html is None')
        soup = BeautifulSoup(html_content, 'html.parser')
        for tag in soup.find_all(href=re.compile(u'http://ipcn.chacuo.net/view/.*')):
            # print re.sub(r'view/c_', "down/t_txt=c_", tag.get('href'))
            content = html_downloader.download(re.sub(r'view/i_', "down/t_txt=c_", tag.get('href')))
            # print tag.get('href')+"--->"
            try:
                print tag.get_text()
                contentstr = self.ipaddress_parse(content)
                # print contentstr
                for ipdata in contentstr.split('\r\n'):
                    # print ipdata
                    data = ipdata.split('\t')
                    # print data
                #     # print data[1]
                #     # print data[0]
                #     # print '-->'+ipdata
                #     # print ipdata.split('\t')[0]
                #     # print data[0].strip()
                    if len(data) > 2:
                        print '--->ip:' + data[0] + '--->mask:' + data[1] + '-->num:' + data[2]
            except Exception, e:
                print "no data"

if __name__ == '__main__':
    html_downloader = HtmlDownloader()

    #region
    # html_content = html_downloader.download('http://www.stats.gov.cn/tjsj/tjbz/xzqhdm/201703/t20170310_1471429.html')
    # html_parser = HtmlParser()
    # html_parser.region_parser(html_content)

    #contry
    # html_content = html_downloader.readhtml('ISO3166-1.html')
    # html_parser = HtmlParser()
    # html_parser.contry_parse(html_content)

    #contry ip address link parse
    # html_content = html_downloader.download('http://ipblock.chacuo.net')
    # html_parser = HtmlParser()
    # html_parser.contry_ipaddrlink_parse(html_content)

    #contry ip address parse
    # html_content = html_downloader.download('http://ipblock.chacuo.net/down/t_txt=c_AO')
    # html_parser = HtmlParser()
    # html_parser.ipaddress_parse(html_content)

    # contry ip address parse to text
    # html_content = html_downloader.download('http://ipblock.chacuo.net')
    # html_parser = HtmlParser()
    # html_parser.ipaddress_parse_text(html_content)

    #cn s ipaddress parse
    html_content = html_downloader.download('http://ips.chacuo.net/')
    html_parser = HtmlParser()
    html_parser.s_ipaddress_parse_text(html_content)

    # cn s ipaddress parse
    html_content = html_downloader.download('http://ipcn.chacuo.net/')
    html_parser = HtmlParser()
    html_parser.isp_ipaddress_parse_text(html_content)

    # print html_content

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import urllib2

import time

import re

from bs4 import BeautifulSoup

class HtmlDownloader(object):

header = {'Cookie': 'AD_RS_COOKIE=20083363',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \

AppleWeb\Kit/537.36 (KHTML, like Gecko)\

Chrome/58.0.3029.110 Safari/537.36'}

def download(self,url):

if url is None:

raise Exception('url is None')

# print url

request = urllib2.Request(url,None,HtmlDownloader.header)

try:

resp = urllib2.urlopen(request)

# print resp.getcode()

if resp.getcode()!= 200:

time.sleep(5)

return self.download(url)

else:

return resp.read()

except urllib2.URLError,e:

print e

time.sleep(5)

return self.download(url)

def readhtml(self,filename):

file_object = open(filename)

try:

all_the_text = file_object.read()

finally:

file_object.close()

return all_the_text

class HtmlParser(object):

def has_tag(self,tag):

return tag.has_attr('span')

def region_parser(self,html_content):

if html_content is None:

raise Exception('html is None')

soup = BeautifulSoup(html_content,'html.parser')

for tag in soup.find_all(class_="MsoNormal"):

# print tag.get_text()

id = tag.get_text().split(" ")[0].strip()

name = tag.get_text().split(" ")[1].strip()

print id+"-->"+name

def contry_parse(self,html_content):

if html_content is None:

raise Exception('html is None')

soup = BeautifulSoup(html_content, 'html.parser')

for tag in soup.find_all(class_="wikitable sortable"):

# print tag

# print tag.select('td')

i = 0

for td in tag.select('td'):

if i % 5 == 0:

print "id-->"+td.get_text().strip()

elif i % 5 == 4:

print "name-->"+td.get_text().strip()

# print str(i)+"----"+td.get_text().strip()

i = i + 1

# code = tag.get_text().split(" ")[0].strip()

# name = tag.get_text().split(" ")[3].strip()

# print code + "-->" + name

def contry_ipaddrlink_parse(self,html_content):

if html_content is None:

raise Exception('html is None')

soup = BeautifulSoup(html_content, 'html.parser')

for tag in soup.find_all(href=re.compile(u'http://ipblock.chacuo.net/view/.*')):

print tag.get_text()+"-->"+tag.get('href')

# html_content = html_downloader.download(tag.get('href'))

# print tag

def ipaddress_parse(self,html_content):

if html_content is None:

raise Exception('html is None')

soup = BeautifulSoup(html_content, 'html.parser')

for tag in soup.find_all('pre'):

# print tag.get_text()+"-->"

return tag.get_text()

def ipaddress_parse_text(self,html_content):

if html_content is None:

raise Exception('html is None')

soup = BeautifulSoup(html_content, 'html.parser')

for tag in soup.find_all(href=re.compile(u'http://ipblock.chacuo.net/view/.*')):

# print re.sub(r'view/c_', "down/t_txt=c_", tag.get('href'))

content = html_downloader.download(re.sub(r'view/c_', "down/t_txt=c_", tag.get('href')))

try:

print tag.get_text()

contentstr = self.ipaddress_parse(content)

# print contentstr

for ipdata in contentstr.split('\r\n'):

data = ipdata.split('\t')

# print data[1]

# print data[0]

# print '-->'+ipdata

# print ipdata.split('\t')[0]

# print data[0].strip()

if len(data) >3:

print '--->ip:'+data[0]+'--->mask:'+data[1]+'-->mask/len:'+data[2]+'-->num:'+data[3]

except Exception,e:

print "no data"

def s_ipaddress_parse_text(self, html_content):

if html_content is None:

raise Exception('html is None')

soup = BeautifulSoup(html_content, 'html.parser')

for tag in soup.find_all(href=re.compile(u'http://ips.chacuo.net/view/.*')):

# print re.sub(r'view/c_', "down/t_txt=c_", tag.get('href'))

content = html_downloader.download(re.sub(r'view/s_', "down/t_txt=p_", tag.get('href')))

# print tag.get('href')+"--->"

try:

print tag.get_text()

contentstr = self.ipaddress_parse(content)

# print contentstr

for ipdata in contentstr.split('\r\n'):

# print ipdata

data = ipdata.split('\t')

# print data

# # print data[1]

# # print data[0]

# # print '-->'+ipdata

# # print ipdata.split('\t')[0]

# # print data[0].strip()

if len(data) > 2:

print '--->ip:' + data[0] + '--->mask:' + data[1] + '-->num:' + data[2]

except Exception, e:

print "no data"

def isp_ipaddress_parse_text(self, html_content):

if html_content is None:

raise Exception('html is None')

soup = BeautifulSoup(html_content, 'html.parser')

for tag in soup.find_all(href=re.compile(u'http://ipcn.chacuo.net/view/.*')):

# print re.sub(r'view/c_', "down/t_txt=c_", tag.get('href'))

content = html_downloader.download(re.sub(r'view/i_', "down/t_txt=c_", tag.get('href')))

# print tag.get('href')+"--->"

try:

print tag.get_text()

contentstr = self.ipaddress_parse(content)

# print contentstr

for ipdata in contentstr.split('\r\n'):

# print ipdata

data = ipdata.split('\t')

# print data

# # print data[1]

# # print data[0]

# # print '-->'+ipdata

# # print ipdata.split('\t')[0]

# # print data[0].strip()

if len(data) > 2:

print '--->ip:' + data[0] + '--->mask:' + data[1] + '-->num:' + data[2]

except Exception, e:

print "no data"

if __name__ == '__main__':

html_downloader = HtmlDownloader()

#region

# html_content = html_downloader.download('http://www.stats.gov.cn/tjsj/tjbz/xzqhdm/201703/t20170310_1471429.html')

# html_parser = HtmlParser()

# html_parser.region_parser(html_content)

#contry

# html_content = html_downloader.readhtml('ISO3166-1.html')

# html_parser = HtmlParser()

# html_parser.contry_parse(html_content)

#contry ip address link parse

# html_content = html_downloader.download('http://ipblock.chacuo.net')

# html_parser = HtmlParser()

# html_parser.contry_ipaddrlink_parse(html_content)

#contry ip address parse

# html_content = html_downloader.download('http://ipblock.chacuo.net/down/t_txt=c_AO')

# html_parser = HtmlParser()

# html_parser.ipaddress_parse(html_content)

# contry ip address parse to text

# html_content = html_downloader.download('http://ipblock.chacuo.net')

# html_parser = HtmlParser()

# html_parser.ipaddress_parse_text(html_content)

#cn s ipaddress parse

html_content = html_downloader.download('http://ips.chacuo.net/')

html_parser = HtmlParser()

html_parser.s_ipaddress_parse_text(html_content)

# cn s ipaddress parse

html_content = html_downloader.download('http://ipcn.chacuo.net/')

html_parser = HtmlParser()

html_parser.isp_ipaddress_parse_text(html_content)

# print html_content

个人经验要点：
1. 爬基本的信息的时候，如果遇到整页信息的，其实可以不用http 请求，特别是想国外网站的（维基百科）,不科学上网还请求不下来，这个时候就直接手动复制一下，然后读入解析就好了，我维基百科就是直接辅助文件，然后解析文件的

2. 遇到二级或者三级页面的时候，可以自己手动点击一下，然后看看页面的跳转，因为大批量类似页面的时候，有时候可能只需要改变页面的一个字符就可以直接请求了

3. 关键点在解析部分，这里我用的是python + BeautifulSoup 爬的，之前我想用go爬，却发现做正则表达式匹配的时候非常困难，然后爬了一个就改为用python了
BeautifulSoup 好像可以直接过滤掉&nbsp这类的字符，然后有很多的接口可以直接调用，获取到title 之类的html 标签，很方便

4. 当爬到纯文本的时候，这个时候要读取行或者列的时候，用字符串的分隔，分成数组，来挑选其中需要的项，我觉得这样是比较方便的。

5. 注意请求头要加一下一些基本的http 请求头信息，否则有的网站会识别，然后不会回应你.

6. 封装请求html 下载页面内容的方法，再解析想要的内容，存入数据库即可。

7. 服务部分就可以直接写服务，读取相应的数据库，查询，提供服务即可。

1 2 下一页 »

^画※哲^

互联网让世界没有陌生人，只有还没认识的小伙伴～

go

prometheus 监控一

golang 遇坑strings.TrimLeft

基础TCP&HTTP微服务架构设计

技术上构建微服务应对需求的不断变更

简单的爬虫实验

2026年4月
一	二	三	四	五	六	日
« 1月
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30