网页抓取超链接页面

M Faizan Farooq

我想从本地网站抓取数据。该代码收集该页面中的表,但我还想收集从表“VO/NGO 名称”字段超链接的数据。

在此处输入图片说明

这是主表。我想要的其他字段来自单击“VO/NGO 名称”时出现的页面。

在此处输入图片说明

我阅读了在线资料,但无法更正代码。

第一个输出应该是这样的,所以每个非政府组织的列表应该是这样的:

在此处输入图片说明

Option Explicit
Public Sub GetInfo()
    Const URL As String = "https://ngodarpan.gov.in/index.php/home/statewise_ngo/76/35/1"
    Dim html As HTMLDocument, hTable As HTMLTable, ws As Worksheet, headers()
    headers = Array("Sr No.", "Name of VO/NGO", "Address", "City","State","Telephone","Mobile No.","Website","Email")
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .Send
        html.body.innerHTML = .responseText
    End With
    Set hTable = html.querySelector("table.dvdtbl")
    Dim td As Object, tr As Object, r As Long, c As Long
    r = 1
    With ws
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        For Each tr In hTable.getElementsByTagName("tr")
            r = r + 1: c = 1
            If r > 3 Then
                For Each td In tr.getElementsByTagName("td")
                    .Cells(r - 2, c) = IIf(c = 2, "'" & td.innerText, td.innerText)
                    c = c + 1
                Next
            End If
        Next
    End With
End Sub
SIM卡

要实现您所追求的结果,需要做几件事。

  1. 您需要解析每个链接的 id 号以在发布请求中重用它。
  2. 您需要从此链接解析 csrf 令牌以用于发布请求

最后,您必须使用任何 json 转换器或脚本控件从该 json 响应中挖掘出各个字段。

我的以下尝试可以为您获取 json 响应。您现在需要做的就是解析 json 以满足您的要求:

Sub FetchTabularInfo()
    Dim Http As New XMLHTTP60, Html As New HTMLDocument
    Dim col As Variant, icol As New Collection
    Dim csrf As Variant, I&

    With Http
        .Open "GET", "https://ngodarpan.gov.in/index.php/home/statewise_ngo/76/35/1", False
        .send
        Html.body.innerHTML = .responseText
    End With

    With Html.querySelectorAll(".table tr a[onclick^='show_ngo_info']")
        For I = 0 To .Length - 1
            icol.Add Split(Split(.item(I).getAttribute("onclick"), "(""")(1), """)")(0)
        Next I
    End With

    For Each col In icol
        With Http
            .Open "GET", "https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf", False
            .send
            csrf = .responseText
        End With

        csrf = Split(Replace(Split(csrf, ":")(1), """", ""), "}")(0)

        With Http
            .Open "POST", "https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info", False
            .setRequestHeader "X-Requested-With", "XMLHttpRequest"
            .setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
            .send "id=" & col & "&csrf_test_name=" & csrf
        End With

        Debug.Print Http.responseText
    Next col
End Sub

第一个引线的输出:

{"status":1,"infor":{"0":{"UniqueID":"AN\/2017\/0161456","Mobile":"9476076176","Email":"[email protected]","ngo_url":"http:\/\/www.adityanatyaacademy.com","ngo_name":"AdityaNatyaAcademy","pan_updDocId":"220156","reg_updDocId":"221361","Off_phone1":null,"Major_Activities1":".Drama\nJatrapala\nStreetplays\nAwareness Programe"},"issues_working_db":"","operational_states_db":"ANDAMAN & NICOBAR ISLANDS, ","operational_district_db":"ANDAMAN & NICOBAR ISLANDS->South Andaman, "},"member_info":[{"SalCode":null,"FName":"ASHUTOSH KARMAKAR","MName":null,"LName":null,"DesigName":"President","EmailId":"[email protected]","MobileNo":"9434262953","pan_updDocId":"223392","aadhaar_updDocId":"223393"},{"SalCode":null,"FName":"KAVERI DEBSHARMA","MName":null,"LName":null,"DesigName":"Member","EmailId":"[email protected]","MobileNo":"9474299901","pan_updDocId":"223400","aadhaar_updDocId":"223401"},{"SalCode":null,"FName":"SATYAJIT BAIN","MName":null,"LName":null,"DesigName":"Asisstant Secretary","EmailId"
:"[email protected]","MobileNo":"9434271746","pan_updDocId":"223408","aadhaar_updDocId":"223409"}],"registeration_info":[{"nr_orgName":"AdityaNatyaAcademy","nr_add":"31 M.G. Road,\nOpp. Sun Sea Resort,\nMiddle Point.","nr_city":"Port Blair","StateName":"ANDAMAN & NICOBAR ISLANDS","reg_name":"Registrar of Companies","TypeDescription":"Registered Societies (Non-Government)","nr_regNo":"888","nr_updDocId":"0","nr_actName":"Society Registration Act 1860","nr_isFcra":"N","fcrano":"","ngo_reg_date":"05-12-1995"}],"source_info":[{"sourcefund":"S","deptt_name":"Directorate of Art and Culture","purpose":"To Promote Art and Culture in Andaman and Nicobar Islands.","datefrom":"2013-04-01","dateto":"2014-03-31","amount_sanctioned":"25000"},{"sourcefund":"S","deptt_name":"Directorate of Art and Culture","purpose":"To promote Art and Culture","datefrom":"2014-04-01","dateto":"2015-03-31","amount_sanctioned":"25000"},{"sourcefund":"S","deptt_name":"Directorate of Art and Culture","purpose":"To promote Art and Cult
ure","datefrom":"2015-04-01","dateto":"2016-03-31","amount_sanctioned":"35000"},{"sourcefund":"S","deptt_name":"Directorate of Art and Culture","purpose":"To promote Art and Culture","datefrom":"2016-04-01","dateto":"2017-03-31","amount_sanctioned":"25000"}]}

添加执行上述脚本的参考:

Microsoft Html Object Library
Microsoft xml, v6.0

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

从具有相同链接的页面进行网页抓取

来自分类Dev

网页抓取流星页面

来自分类Dev

网页抓取流星页面

来自分类Dev

从iframe页面抓取网页

来自分类Dev

使用 Rvest 抓取超链接

来自分类Dev

如何使用Java在网页上单击超链接而不在浏览器中打开页面

来自分类Dev

通过链接文本抓取网页

来自分类Dev

抓取,抓取链接,然后抓取页面

来自分类Dev

Python + scrapy + 网页抓取:页面未被抓取

来自分类Dev

如何使用PHP从动态网页中抓取页面链接?

来自分类Dev

Python + 网页抓取 + scrapy:如何从 IMDb 页面获取所有电影的链接?

来自分类Dev

使用BeautifulSoup从网页中抓取特定链接

来自分类Dev

从网页中抓取选择性链接

来自分类Dev

从网站抓取网页以捕获分页链接的问题

来自分类Dev

R - 帮我从网页上抓取链接

来自分类Dev

网页抓取 pdf 链接 - 不返回结果

来自分类Dev

为网页抓取器创建简化的页面

来自分类Dev

具有多个部分的网页抓取页面

来自分类Dev

使用python为多个页面抓取网页

来自分类Dev

网页抓取 - 使用 R 的多个页面

来自分类Dev

如何与子页面并行抓取网页?

来自分类Dev

网页抓取 - 如何获取网页链接的特定部分

来自分类Dev

网页抓取:抓取页面并在 DataFrame 中存储内容

来自分类Dev

使用页面锚点到超链接的RequestNavigate

来自分类Dev

在移动设备上超链接页面

来自分类Dev

使用BeautifulSoup + Python在href内抓取超链接

来自分类Dev

使用工作超链接将网页“打印”为pdf

来自分类Dev

如何导出网页上的所有超链接?

来自分类Dev

弹出网页表单没有按钮和超链接