Menu

Virtual Geek

Tales from real IT system administrators world and non-production environment

Powershell web scrapping extract table from html

This is small script I use to collect the table data from website to Powershell Object and then further it can be saved to csv file. For this example I am taking table example from https://datatables.net. I will try to convert table to Powershell Object. HTML tables has tags th (Table header) and td (Table data), all the data is between these HTMLs tags, to extract the data, I will just remove the tags and convert them to object using add-member. In the script I have made use of PowerShell slice array into groups of smaller arrays. Break the table rows using table header count. This scripts is running good for single html table.

Microsoft Powershell Invoke-webrequest all elements where-object new-object system collection arraylist for loop psobject html table add-member.png

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#Open website
$webSite = Invoke-WebRequest -Uri https://datatables.net/

#Find table in the website
$tableHeader = $webSite.AllElements | Where-Object {$_.tagname -eq 'th'}
$tableData = $webSite.AllElements | Where-Object {$_.tagname -eq 'td'}

#Table header and data
$thead = $tableHeader.innerText[0..(($tableHeader.innerText.count/2) - 1)]
$tdata = $tableData.innerText

#Break table data into smaller chuck of data.
$dataResult = New-Object System.Collections.ArrayList
for ($i = 0; $i -le $tdata.count; $i+= ($header.count - 1))
{
    if ($tdata.count -eq $i)
    {
        break
    }        
    $group = $i + ($header.count - 1)
    [void]$dataResult.Add($tdata[$i..$group])
    $i++
}

#Html data into powershell table format
$finalResult = @()
foreach ($data in $dataResult)
{
    $newObject = New-Object psobject
    for ($i = 0; $i -le ($thead.count - 1); $i++) {
        $newObject | Add-Member -Name $thead[$i] -MemberType NoteProperty -value $data[$i]
    }
    $finalResult += $newObject
}
$finalResult | ft -AutoSize

In this second example I am using same technique but copy pasting table from source of HTML, if there are multiple tables. I have copied the raw table and beatified it for better results.

Microsoft powershell html thead table tr td thead to pscustomobject psobject add-member foreach-object arraylist system.object array.png

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
$rawdata = @'
    </table>
        <thead>
            <tr>
                <th>First name</th>
                <th>Last name</th>
                <th>Position</th>
                <th>Office</th>
                <th>Age</th>
                <th>Start date</th>
                <th>Salary</th>
                <th>Extn.</th>
                <th>E-mail</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Tiger</td>
                <td>Nixon</td>
                <td>System Architect</td>
                <td>Edinburgh</td>
                <td>61</td>
                <td>2011/04/25</td>
                <td>$320,800</td>
                <td>5421</td>
                <td>t.nixon@datatables.net</td>
            </tr>
            <tr>
                <td>Garrett</td>
                <td>Winters</td>
                <td>Accountant</td>
                <td>Tokyo</td>
                <td>63</td>
                <td>2011/07/25</td>
                <td>$170,750</td>
                <td>8422</td>
                <td>g.winters@datatables.net</td>
            </tr>
            <tr>
                <td>Ashton</td>
                <td>Cox</td>
                <td>Junior Technical Author</td>
                <td>San Francisco</td>
                <td>66</td>
                <td>2009/01/12</td>
                <td>$86,000</td>
                <td>1562</td>
                <td>a.cox@datatables.net</td>
            </tr>
            <tr>
                <td>Cedric</td>
                <td>Kelly</td>
                <td>Senior Javascript Developer</td>
                <td>Edinburgh</td>
                <td>22</td>
                <td>2012/03/29</td>
                <td>$433,060</td>
                <td>6224</td>
                <td>c.kelly@datatables.net</td>
            </tr>
            <tr>
                <td>Airi</td>
                <td>Satou</td>
                <td>Accountant</td>
                <td>Tokyo</td>
                <td>33</td>
                <td>2008/11/28</td>
                <td>$162,700</td>
                <td>5407</td>
                <td>a.satou@datatables.net</td>
            </tr>
        </tbody>
    </table>
'@

$th = $rawdata -split "`r`n"| where-object {$_ -match '<th>'}
$td = $rawdata -split "`r`n"| where-object {$_ -match '<td>'}

$thead = (($th -replace '<th>', '') -replace '</th>', '').trim()
$tdata = (($td -replace '<td>', '') -replace '</td>', '').trim()

$groupOf = 
$result = New-Object System.Collections.ArrayList
for ($i = 0; $i -le $tdata.count; $i+= ($thead.count - 1))
{
    if ($tdata.count -eq $i)
    {
        break
    }        
        
    $group9 = $i + ($thead.count - 1)
    #"{0} {1}" -f $i, $group9

    [void]$result.Add($tdata[$i..$group9])
    #$result = @($tdata[$i..$group9]) #-join ', '
    $i++
}

$finalResult = @()
foreach ($data in $result)
{
    $finalResult += [pscustomobject]@{
        $thead[0] = $data[0]
        $thead[1] = $data[1]
        $thead[2] = $data[2]
        $thead[3] = $data[3]
        $thead[4] = $data[4]
        $thead[5] = $data[5]
        $thead[6] = $data[6]
        $thead[7] = $data[7]
        $thead[8] = $data[8]
    }
}
$finalResult | ft -Auto

Download this script here, it is also available on github.com/kunaludapi.

Useful Articles
Powershell Convert range of numbers into another list of numbers maintaining ratio
Powershell Active Directory: List complete hierarchy of upstream nested groups recursively of User
Powershell Active Directory: Show treeview of User or Group memberof hierarchy
Powershell Active Directory: Show treeview of nested Group members downstream hierarchy
Oneliner Microsoft Powershell Script Get members from a list of group from Active Directory in excel
Powershell Active Directory 1: Check, enable and disable child OU protect object from accidental deletion
Oneliner Powershell how to add or remove AD user members and objects in groups

Go Back



Comment

Blog Search

Page Views

11272615

Follow me on Blogarama