Data Mining and Plotting with Python: Part 2

In Part 1 of this series, the process of acquiring a list of valid plat IDs was covered. Selenium in headless mode executing javascript was the primary tool, source available here. The next step is using those IDs to actually collect the GIS data.

Table of Contents

Pandas Dataframe

In this particular case, the table of data is known, so once a list of IDs was collected, I started to figure out how I could mine the table. Below is what an example table looks like:

Remember in Part 1, we saved all discovered parcel IDs by their Major ID number, so this miner script opens those csv files up as a dataframe:


csv_file = 'Plat_ID' + str("{0:02}".format(parcel_major)) + '.csv'

df_in = pd.read_csv(csv_file)

The pandas dataframe is then initialized with the appropriate column names, since we know the table contents in advance:


df = pd.DataFrame(columns=['Property_ID',

                               'Taxpayer1',

                               'Taxpayer2',

                               'Taxpayer_Address',

                               'Property_Address',

                               'Use',

                               'Acres',

                               'Sale_Date',

                               'Sale_Price',

                               'PLSS',

                               'Title_Source',

                               'School_District',

                               'Watershed_District'])

BeautifulSoup

Looping over the IDs, the table can be searched for within the source code of the web page. As an aside, there were some alterations made such as closing the Selenium driver after 100 executions, as I noticed, for whatever reason, Selenium seemed to get very slow after several thousand executions (unsure what was causing this). Once the page source is pulled, it is then passed into a tool called BeautifulSoup. This tool provides a clear means to search for elements in HTML. In the below code, a table element with the class, propDetails, is search for. If found, the table in its entirety resides within the table variable to be parsed later.


for parcelid in df_in['Plat_ID']:

    if(scancnt < 100):

        scancnt = scancnt + 1

    else:

        scancnt = 0

        driver.close()

        driver = webdriver.Chrome(service=srvc, options=chrome_options)

 

        url = 'https://gis.co.carver.mn.us/publicparcel/?pin=' + \

        str("{0:09}".format(parcelid))

    driver.get(url)

    results = []

    content = driver.page_source



    soup = BeautifulSoup(content, features="html.parser")

    table = soup.find('table', attrs={'class': 'propDetails'})

HTML Table

The table which Selenium collects is shown below:


<table cellpadding="0" cellspacing="0" class="propDetails">

    <tbody>

        <tr>

            <td class="lable">Taxpayer</td>

            <td class="data">VICTORIA CITY </td>

        </tr>

        <tr>

            <td class="lable">Taxpayer 2</td>

            <td class="data">C/O CITY CLERK </td>

        </tr>

        <tr>

            <td class="lable">Taxpayer Address</td>

            <td class="data">PO BOX 36 </td>

        </tr>

        <tr>

            <td class="lable"></td>

            <td class="data">VICTORIA, MN 55386-</td>

        </tr>

        <tr>

            <td class="lable">Property Address</td>

            <td class="data">8270 KELZER POND DR </td>

        </tr>

        <tr>

            <td class="lable"></td>

            <td class="data">VICTORIA, MN 55386</td>

        </tr>

        <tr>

            <td class="lable">City/Township</td>

            <td class="data">VICTORIA</td>

        </tr>

        <tr>

            <td class="lable">Use 1</td>

            <td class="data">Muni Srvc Other </td>

        </tr>

        <tr>

            <td class="lable">Mapped Acres</td>

            <td class="data">24.01</td>

        </tr>

        <tr>

            <td class="lable">PLSS</td>

            <td class="data">TWP 116, RNG 024, SEC 14</td>

        </tr>

        <tr>

            <td class="lable">Plat Name</td>

            <td class="data">WATERMARK</td>

        </tr>

        <tr>

            <td class="lable">Title Source</td>

            <td class="data">Abstract</td>

        </tr>

        <tr>

            <td class="lable">Tax Description<p><b> NOTE: For tax purposes only. Not to be used on legal documents</b>

                </p>

            </td>

            <td class="data">OUTLOT A</td>

        </tr>

        <tr>

            <td class="lable">School District</td>

            <td class="data">0112 </td>

        </tr>

        <tr>

            <td class="lable">Watershed District</td>

            <td class="data">WS 062 MINNEHAHA CREEK </td>

        </tr>

    </tbody>

</table>

BeautifulSoup Output

With the table acquired, we can now use the find_all function of BeautifulSoup to look for data and lable within the element td. both the data and label can be searched for to be parsed and brought into a dataframe as follows.


d = table.find_all('td', attrs={'class': 'data'})

l = table.find_all('td', attrs={'class': 'lable'})

if(d != [] and l != []):

...

The subsequent output for both the data and lable are as follows:

Data


[<td class="data">VICTORIA CITY </td>, <td class="data">C/O CITY CLERK </td>, <td class="data">PO BOX 36 </td>, <td

    class="data">VICTORIA, MN 55386-</td>, <td class="data">8270 KELZER POND DR </td>, <td class="data">VICTORIA, MN

    55386</td>, <td class="data">VICTORIA</td>, <td class="data">Muni Srvc Other </td>, <td class="data">24.01</td>, <td

    class="data">TWP 116, RNG 024, SEC 14</td>, <td class="data">WATERMARK</td>, <td class="data">Abstract</td>, <td

    class="data">OUTLOT A</td>, <td class="data">0112 </td>, <td class="data">WS 062 MINNEHAHA CREEK </td>]

lable


[<td class="lable">Taxpayer</td>, <td class="lable">Taxpayer 2</td>, <td class="lable">Taxpayer Address</td>, <td

    class="lable"></td>, <td class="lable">Property Address</td>, <td class="lable"></td>, <td class="lable">

    City/Township</td>, <td class="lable">Use 1</td>, <td class="lable">Mapped Acres</td>, <td class="lable">PLSS</td>,

<td class="lable">Plat Name</td>, <td class="lable">Title Source</td>, <td class="lable">Tax Description<p><b> NOTE: For

            tax purposes only. Not to be used on legal documents</b></p>

</td>, <td class="lable">School District</td>, <td class="lable">Watershed District</td>]

Each element within the lable array can be used to determine where in the dataframe to populate the appropriate data or more specifically, which column to place the data. Another method that likely would have worked as well would be to simply know the indexes of each element of the table, however, I did notice that based on the type of property (residential, commercial, etc…) the table format changed. This allowed at least some data collection if the format/order changed.


for idx, item in enumerate(l):

    stritem = item.text.strip()

        if(stritem == "Taxpayer"):

            tp1 = d[idx].text.strip()

        elif(stritem == "Taxpayer 2"):

            tp2 = d[idx].text.strip()

...

With each string pulled from the table, the dataframe can be appended to for this particular row/element.


df = df.append({'Property_ID': str(parcelid),

                'Taxpayer1': tp1,

                'Taxpayer2': tp2,

                'Taxpayer_Address': tpaddr,

                'Property_Address': paddr,

                'Use': use,

                'Acres': acres,

                'Sale_Date': saledate,

                'Sale_Price': saleprice,

                'PLSS': plss,

                'Title_Source': titlesrc,

                'School_District': school,

                'Watershed_District': watershed,

                }, ignore_index=True)

Final dataframe

Property_ID

Taxpayer1

Taxpayer2

Taxpayer_Address

Property_Address

Use

Acres

Sale_Date

Sale_Price

PLSS

Title_Source

School_District

Watershed_District

655450580

VICTORIA CITY

C/O CITY CLERK

PO BOX 36 VICTORIA, MN 55386-

8270 KELZER POND DR VICTORIA, MN 55386

Muni Srvc Other

24.01

TWP 116, RNG 024, SEC 14

Abstract

112

WS 062 MINNEHAHA CREEK

Lastly, the dataframe is saved into csv format.


df.to_csv(filename, index=False, encoding='utf-8')

Next Step: Mapping the Data

Although the process of mining data was interesting in that I was amazed at how easily a large amount of data could be collected, the real fun comes next when this data is presented in a meaningful way. In Part 3 of this tutorial, I’ll be showing how to use folium to plot data on a geographical map.

Source Code:

tablemine.py

Data Mining with Selenium and BeautifulSoup

Data Mining and Plotting with Python: Part 2

Pandas Dataframe

BeautifulSoup

HTML Table

BeautifulSoup Output

Data

lable

Final dataframe

Next Step: Mapping the Data

Source Code:

About Mark

Categories

The Source

Pages

Data Mining with Selenium and BeautifulSoup

Data Mining and Plotting with Python: Part 2

Pandas Dataframe

BeautifulSoup

HTML Table

BeautifulSoup Output

Data

lable

Final dataframe

Next Step: Mapping the Data

Source Code:

About Mark

You also might be interested in

Data Mining and Plotting with Python: Part 1

Data Mining and Plotting with Python: Part 3

Categories

The Source

Pages