I find ChatGPT very good for data wrangling works that make up 80% of a data scientist's daily work.
Context
I have a script to make a race plan given a GPX file and list of check points (example output of Lantau 70). The GPX file can be found from race organiser website.
Due to technology lagging, many organizer's today don't have check points embedded in the GPX file. Towards this end, my initial design is to allow user to input a list of check points in form of, 1) cummulative distance from start; 2) Or latitude and longitude tuple.
I got request from an ultra runner today to make a race plan for TNF 100 Mo Gan Shan. The official GPX file has check points ("waypoints", aka wpt
) embedded. GPX file is an XML file in plain text format, so we can extract the information with any text editor and input in my planner spreadsheet.
Work with ChatGPT
GPX file overview
Following is the overview of a GPX file in case the reader did not see one before:
- Part 1 is the list of way points / check points (usually designated by organiser from a map)
- Part 2 is the list of track points (usually recorded from a sports watch/ or designed from a mapping software)
<?xml version="1.0" encoding="utf-8"?>
<gpx
version="1.0"
xmlns="http://www.topografix.com/GPX/1/0"
creator="ovitalmap - http://www.ovital.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.topografix.com/GPX/1/0 http://www.topografix.com/GPX/1/0/gpx.xsd">
<time>2013-03-15T00:00:00Z</time>
<wpt lon="119.89916200" lat="30.60791600" >
<ele>53</ele>
<name>起终点 郡安里度假酒店</name>
</wpt>
<wpt lon="119.89111824" lat="30.53001315" >
<ele>64</ele>
<name>CP1 对河口水库</name>
</wpt>
... other way points omitted
<trk>
<name>100km</name>
<trkseg>
<trkpt lon="119.89916200" lat="30.60791600" >
<ele>53</ele>
<time>1970-01-01T07:59:59Z</time>
</trkpt>
<trkpt lon="119.89952100" lat="30.60759700" >
<ele>57</ele>
<time>1970-01-01T07:59:59Z</time>
</trkpt>
<trkpt lon="119.89949000" lat="30.60755500" >
<ele>57</ele>
... other track point ommitted
<ele>53</ele>
<time>1970-01-01T07:59:59Z</time>
</trkpt>
<trkpt lon="119.89919300" lat="30.60792400" >
<ele>53</ele>
<time>1970-01-01T07:59:59Z</time>
</trkpt>
<trkpt lon="119.89919300" lat="30.60792900" >
<ele>53</ele>
<time>1970-01-01T07:59:59Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>
Instruction to ChatGPT
I give ChatGPT the following instructions:
read the following XML file, and convert to a tab separated table with name, lat, lon, ele
------
<wpt lon="119.89916200" lat="30.60791600" >
<ele>53</ele>
<name>起终点 郡安里度假酒店</name>
</wpt>
<wpt lon="119.89111824" lat="30.53001315" >
<ele>64</ele>
<name>CP1 对河口水库</name>
</wpt>
<wpt lon="119.87026124" lat="30.49539384" >
<ele>120</ele>
<name>CP2-西岑坞</name>
</wpt>
<wpt lon="119.80487134" lat="30.50543799" >
<ele>533</ele>
<name>CP3-王位山</name>
</wpt>
<wpt lon="119.78959181" lat="30.52616307" >
<ele>196</ele>
<name>CP4-下郎村</name>
</wpt>
<wpt lon="119.80046350" lat="30.53920958" >
<ele>124</ele>
<name>CP5-清逸山居</name>
</wpt>
<wpt lon="119.80986392" lat="30.53757413" >
<name>CP6-东沈村</name>
</wpt>
<wpt lon="119.80834456" lat="30.60588694" >
<ele>178</ele>
<name>CP7-瑶坞</name>
</wpt>
<wpt lon="119.82839453" lat="30.62405749" >
<ele>636</ele>
<name>CP8-大洋里</name>
</wpt>
<wpt lon="119.86925290" lat="30.63074462" >
<ele>132</ele>
<name>CP9-大树下</name>
</wpt>
Response
name | lat | lon | ele |
起终点 郡安里度假酒店 | 30.60791600 | 119.89916200 | 53 |
CP1 对河口水库 | 30.53001315 | 119.89111824 | 64 |
CP2-西岑坞 | 30.49539384 | 119.87026124 | 120 |
CP3-王位山 | 30.50543799 | 119.80487134 | 533 |
CP4-下郎村 | 30.52616307 | 119.78959181 | 196 |
CP5-清逸山居 | 30.53920958 | 119.80046350 | 124 |
CP6-东沈村 | 30.53757413 | 119.80986392 | N/A |
CP7-瑶坞 | 30.60588694 | 119.80834456 | 178 |
CP8-大洋里 | 30.62405749 | 119.82839453 | 636 |
CP9-大树下 | 30.63074462 | 119.86925290 | 132 |
To convert this XML file to a tab-separated table with columns "name," "lat," "lon," and "ele," we simply need to extract the corresponding attributes and values for each "wpt" element and populate the table accordingly. For each "wpt" element, the "name" attribute corresponds to the first column, the "lat" attribute corresponds to the second column, the "lon" attribute corresponds to the third column, and the "ele" element corresponds to the fourth column.
Note that there is one "wpt" element with a missing "ele" element, which we denote as "N/A."
Notes of response
- ChatGPT masters markdown very well, so the output table is well formatted. The table can be rendered in the Chat UI, or be copied into Google Sheets directly.
- It even proactively pointed out that one way point has no elevation!
Reflection
About data wrangling: (ref)
Data wrangling is the process of converting raw data into a usable form. It may also be called data munging or data remediation.
In contrast to the public's image of working on fancy models and visualizations everyday, many data scientists spend a majority of their time to transform data into usable format.
In our case today, I have two approaches in the past:
- Manually copy and paste from XML to Google Sheets -- If I deem this to be one off work, or when the data volume is really small, I go this route to speed up time-to-insights-lag.
- Write a parser script -- 10 lines of Python could do. The concept is easy but troubleshooting and handling corner cases may take some uncertain time. More importantly, we end up building many one-off scripts that are never re-used.
With ChatGPT, we can finish the data wrangling part in 5min. We can focus more of the core part, i.e. designing race plan in my case by tuning pace/ rest time/ etc.
In the past, either way of data wrangling is not ideal.
For method 1, I feel doing very fundamental work but the complexity & scale makes it very hard to delegate. One can finish those tasks with manual effort much faster than explaning the task to an assistant.
For method 2, I often enjoy writing concise Python scripts, just to keep my mind fresh (like people going to gym), but I know from bottom of my heart those scripts do not generalize well.
We can of course build a utility function for this task though:
import xml.etree.ElementTree as ET
def parse_xml_to_table(tree):
# Find all "wpt" elements and extract the desired attributes and elements
rows = []
for wpt in tree.findall('wpt'):
name = wpt.find('name').text
lat = wpt.get('lat')
lon = wpt.get('lon')
ele_elem = wpt.find('ele')
ele = ele_elem.text if ele_elem is not None else 'N/A'
# Add the values to the list of rows
rows.append([name, lat, lon, ele])
# Return the resulting table (as a list of rows)
return rows
def xml_to_table(filename):
# Read in the XML file and parse it into an ElementTree object
tree = ET.parse(filename)
# Call the parse_xml_to_table function to extract the desired information
table = parse_xml_to_table(tree)
# Return the resulting table (as a list of rows)
return table
Example usage:
table = xml_to_table('input.xml')
# Iterate over the rows of the resulting table and print them to the console
for row in table:
print('\t'.join(row))
However, it is unlikely for me to do similar tasks next time. The nice utility function is like the T-Shirt at bottom of my wardrobe -- you know it is useful "some time", but the time never come.
BTW, above two Python snippets are written by ChatGPT!
One may wonder if we shall ask ChatGPT to write the Python snippet directly, instead of just parsing the table?
I know writing Python script feels pretty advanced, but I still prefer to ask for table output on first iteration, for two reasons:
- I don't have to run a Python interpreter / troubleshoot and get the output. I can proceed to my core business sooner.
- In our example, the missing elevation, aka "N/A", is well handled in Python script, primarily because we have a bad data point in the sample. This inspires us to come up with input/ output examples before instructing ChatGPT to design something for scale. This also mimics the real work of a data scientist. We often don't know what is in the dataset before we delve into it. So we prefer interactive environments, e.g. Jupyter Notebook, so that we can explore and fix things on the fly.