ChatGPT for Data Wrangling Works

ChatGPT for Data Wrangling Works

How AI replaces majority of data scientists

·

6 min read

I find ChatGPT very good for data wrangling works that make up 80% of a data scientist's daily work.

hupili_robot\_reading\_a\_GPX\_file\_and\_write\_python\_script\_88b6c747-ffa8-4c80-ad68-13592cee1c98.png

Context

I have a script to make a race plan given a GPX file and list of check points (example output of Lantau 70). The GPX file can be found from race organiser website.

Due to technology lagging, many organizer's today don't have check points embedded in the GPX file. Towards this end, my initial design is to allow user to input a list of check points in form of, 1) cummulative distance from start; 2) Or latitude and longitude tuple.

I got request from an ultra runner today to make a race plan for TNF 100 Mo Gan Shan. The official GPX file has check points ("waypoints", aka wpt) embedded. GPX file is an XML file in plain text format, so we can extract the information with any text editor and input in my planner spreadsheet.

Work with ChatGPT

GPX file overview

Following is the overview of a GPX file in case the reader did not see one before:

  1. Part 1 is the list of way points / check points (usually designated by organiser from a map)
  2. Part 2 is the list of track points (usually recorded from a sports watch/ or designed from a mapping software)
<?xml version="1.0" encoding="utf-8"?>
<gpx
        version="1.0"
        xmlns="http://www.topografix.com/GPX/1/0"
        creator="ovitalmap - http://www.ovital.com"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.topografix.com/GPX/1/0 http://www.topografix.com/GPX/1/0/gpx.xsd">

<time>2013-03-15T00:00:00Z</time>
<wpt lon="119.89916200" lat="30.60791600" >
        <ele>53</ele>
        <name>起终点 郡安里度假酒店</name>
</wpt>
<wpt lon="119.89111824" lat="30.53001315" >
        <ele>64</ele>
        <name>CP1 对河口水库</name>
</wpt>


... other way points omitted


<trk>
        <name>100km</name>
        <trkseg>
                <trkpt lon="119.89916200" lat="30.60791600" >
                        <ele>53</ele>
        <time>1970-01-01T07:59:59Z</time>
                </trkpt>
                <trkpt lon="119.89952100" lat="30.60759700" >
                        <ele>57</ele>
        <time>1970-01-01T07:59:59Z</time>
                </trkpt>
                <trkpt lon="119.89949000" lat="30.60755500" >
                        <ele>57</ele>

... other track point ommitted

                        <ele>53</ele>
        <time>1970-01-01T07:59:59Z</time>
                </trkpt>
                <trkpt lon="119.89919300" lat="30.60792400" >
                        <ele>53</ele>
        <time>1970-01-01T07:59:59Z</time>
                </trkpt>
                <trkpt lon="119.89919300" lat="30.60792900" >
                        <ele>53</ele>
        <time>1970-01-01T07:59:59Z</time>
                </trkpt>
        </trkseg>
</trk>
</gpx>

Instruction to ChatGPT

I give ChatGPT the following instructions:

read the following XML file, and convert to a tab separated table with name, lat, lon, ele

------

<wpt lon="119.89916200" lat="30.60791600" >
        <ele>53</ele>
        <name>起终点 郡安里度假酒店</name>
</wpt>
<wpt lon="119.89111824" lat="30.53001315" >
        <ele>64</ele>
        <name>CP1 对河口水库</name>
</wpt>
<wpt lon="119.87026124" lat="30.49539384" >
        <ele>120</ele>
        <name>CP2-西岑坞</name>
</wpt>
<wpt lon="119.80487134" lat="30.50543799" >
        <ele>533</ele>
        <name>CP3-王位山</name>
</wpt>
<wpt lon="119.78959181" lat="30.52616307" >
        <ele>196</ele>
        <name>CP4-下郎村</name>
</wpt>
<wpt lon="119.80046350" lat="30.53920958" >
        <ele>124</ele>
        <name>CP5-清逸山居</name>
</wpt>
<wpt lon="119.80986392" lat="30.53757413" >
        <name>CP6-东沈村</name>
</wpt>
<wpt lon="119.80834456" lat="30.60588694" >
        <ele>178</ele>
        <name>CP7-瑶坞</name>
</wpt>
<wpt lon="119.82839453" lat="30.62405749" >
        <ele>636</ele>
        <name>CP8-大洋里</name>
</wpt>
<wpt lon="119.86925290" lat="30.63074462" >
        <ele>132</ele>
        <name>CP9-大树下</name>
</wpt>

Response

namelatlonele
起终点 郡安里度假酒店30.60791600119.8991620053
CP1 对河口水库30.53001315119.8911182464
CP2-西岑坞30.49539384119.87026124120
CP3-王位山30.50543799119.80487134533
CP4-下郎村30.52616307119.78959181196
CP5-清逸山居30.53920958119.80046350124
CP6-东沈村30.53757413119.80986392N/A
CP7-瑶坞30.60588694119.80834456178
CP8-大洋里30.62405749119.82839453636
CP9-大树下30.63074462119.86925290132

To convert this XML file to a tab-separated table with columns "name," "lat," "lon," and "ele," we simply need to extract the corresponding attributes and values for each "wpt" element and populate the table accordingly. For each "wpt" element, the "name" attribute corresponds to the first column, the "lat" attribute corresponds to the second column, the "lon" attribute corresponds to the third column, and the "ele" element corresponds to the fourth column.

Note that there is one "wpt" element with a missing "ele" element, which we denote as "N/A."

Notes of response

  • ChatGPT masters markdown very well, so the output table is well formatted. The table can be rendered in the Chat UI, or be copied into Google Sheets directly.
  • It even proactively pointed out that one way point has no elevation!

Reflection

About data wrangling: (ref)

Data wrangling is the process of converting raw data into a usable form. It may also be called data munging or data remediation.

In contrast to the public's image of working on fancy models and visualizations everyday, many data scientists spend a majority of their time to transform data into usable format.

In our case today, I have two approaches in the past:

  1. Manually copy and paste from XML to Google Sheets -- If I deem this to be one off work, or when the data volume is really small, I go this route to speed up time-to-insights-lag.
  2. Write a parser script -- 10 lines of Python could do. The concept is easy but troubleshooting and handling corner cases may take some uncertain time. More importantly, we end up building many one-off scripts that are never re-used.

With ChatGPT, we can finish the data wrangling part in 5min. We can focus more of the core part, i.e. designing race plan in my case by tuning pace/ rest time/ etc.

In the past, either way of data wrangling is not ideal.

For method 1, I feel doing very fundamental work but the complexity & scale makes it very hard to delegate. One can finish those tasks with manual effort much faster than explaning the task to an assistant.

For method 2, I often enjoy writing concise Python scripts, just to keep my mind fresh (like people going to gym), but I know from bottom of my heart those scripts do not generalize well.

We can of course build a utility function for this task though:

import xml.etree.ElementTree as ET

def parse_xml_to_table(tree):
    # Find all "wpt" elements and extract the desired attributes and elements
    rows = []
    for wpt in tree.findall('wpt'):
        name = wpt.find('name').text
        lat = wpt.get('lat')
        lon = wpt.get('lon')
        ele_elem = wpt.find('ele')
        ele = ele_elem.text if ele_elem is not None else 'N/A'

        # Add the values to the list of rows
        rows.append([name, lat, lon, ele])

    # Return the resulting table (as a list of rows)
    return rows


def xml_to_table(filename):
    # Read in the XML file and parse it into an ElementTree object
    tree = ET.parse(filename)

    # Call the parse_xml_to_table function to extract the desired information
    table = parse_xml_to_table(tree)

    # Return the resulting table (as a list of rows)
    return table

Example usage:

table = xml_to_table('input.xml')

# Iterate over the rows of the resulting table and print them to the console
for row in table:
    print('\t'.join(row))

However, it is unlikely for me to do similar tasks next time. The nice utility function is like the T-Shirt at bottom of my wardrobe -- you know it is useful "some time", but the time never come.

BTW, above two Python snippets are written by ChatGPT!

One may wonder if we shall ask ChatGPT to write the Python snippet directly, instead of just parsing the table?

I know writing Python script feels pretty advanced, but I still prefer to ask for table output on first iteration, for two reasons:

  1. I don't have to run a Python interpreter / troubleshoot and get the output. I can proceed to my core business sooner.
  2. In our example, the missing elevation, aka "N/A", is well handled in Python script, primarily because we have a bad data point in the sample. This inspires us to come up with input/ output examples before instructing ChatGPT to design something for scale. This also mimics the real work of a data scientist. We often don't know what is in the dataset before we delve into it. So we prefer interactive environments, e.g. Jupyter Notebook, so that we can explore and fix things on the fly.

Did you find this article valuable?

Support HU, Pili by becoming a sponsor. Any amount is appreciated!