Using Large Language Models To Clean Data

October 15, 2023 #california-traffic-data #generative-ai #large-language-models #machine-learning

A pencil drawing of a robot inspecting a car. Prompt: A simple color pencil drawing a robot, inspecting a car, holding a clipboard, white background.

I maintain the SWITRS-to-sqlite Python library that parses and cleans up California Highway Patrol’s traffic collision database. One of the fields the responding officer has to fill out at the scene of the crash is the make¹ of the vehicle. This field is a free text field, but there is a relatively small number of common brands, so it should be mapped to a categorical column.

This is straightforward when the officer writes FORD or HONDA, which they mostly do. But since the officer can write anything, they occasionally make it a little harder on us by abbreviating or mistyping, for example VOLX and DODDGE. And sometimes they make it impossible by writing -- or ______.

The solution is to go through, one by one, and create a mapping like:

# Enumeration of common vehicle makes
@unique
class Make(Enum):
  CHEVROLET  = "chevrolet"
  GMC        = "gmc"
  HINO       = "hino"
  INFINITI   = "infiniti"
  MITSUBISHI = "mitsubishi"
  # Special Token for unknown make
  NONE       = None

# Dictionary mapping raw values to Make enum
make_map = {
  "CHEVRLT":  Make.CHEVROLET,
  "HINO/":    Make.HINO,
  "INFINITY": Make.INFINITI,
  "MITSUB":   Make.MITSUBISHI,
  "TAHOE":    Make.GMC,
  "UKNOWN":   Make.NONE,
}

As someone who did this mapping by hand for over 900 entries, it is quite tedious. Fortunately, making sense of mangled text is something Large Language Models (LLMs) are pretty good at!

Automating

The goal is to perform few-shot, multi-label classification of vehicle makes. Few-shot because we are going to give the model just a handful of examples of what output we expect, and multi-label because there are many possible vehicle makes it will have to map to.

Prompting

The first step is to write a prompt explaining the task to the model, the expected return value, and a few examples of input and correct outputs. Here is a shortened version, the full one is here, starting with the instructions:

I am working with a dataset of traffic collisions from California. One of the fields is the “make” of the vehicle, for example, “Honda”, “Ford”, “Peterbilt”, etc.

But this field a free-text field filled out by the CHP officer on the scene of the collision. As such there are misspellings, abbreviations, and other mistakes that have to be fixed.

I have created a set of makes as follows (including NONE as a placeholder for unknown values). Here is the list in a Python Enum:
@unique
class Make(Enum):
    ACADIAN                 = "acadian"
    ACURA                   = "acura"
    ALFA_ROMERO             = "alfa romera"
    AMC                     = "american motors"
    ...
Take note that anything unknown should be tagged with Make.None. And do not make up new Enum values.

Then the output format, with instructions to include an explanation of its logic first, which can help model accuracy:

I will provide you with a string. You are to return a Python dictionary with the following keys, in this same order:

{
  explanation: "An explanation of why you think the enum value is a good match, or why there is no match possible.",
  input_string: "The input string",
  enum: "The correct enum from above",
  no_match: "`True` or `False`. True if there is no matching enum or no way to make a match, otherwise False.", 
}

And finally some examples of inputs and correct outputs:

For example, for the input VOLX:

{
  explanation: """VOLX is pronouced similarly to 'Volks' and therefore this is
    probably an abbreviation of 'Volkswagen'. There is an enum value for
    Volkswagon, `Make.VOLKSWAGEN`, already so we use that.""",
  input_string: "VOLX",
  enum: make.VOLKSWAGEN,
  no_match: False,
}

Answers

Since I was manually copying the prompt into the model’s web interface, I used batches of 100–200 string sorted alphabetically. With API access, I could have used retrieval-augmented generation to create custom examples for each string while sending them one at a time.

Splitting the data into batches helped the model figure out very short entries. For example, the model failed when given WNBG (Winnebago) by itself, but succeeded when I gave it the list:

WINN
WINNE   
WINNEBAG
WINNEBAGO
WINNI
WNBG 
WNBGO

I believe seeing multiple short versions next to eachother helped the model infer the right mapping.

Performance

I obtained the following performance on my 902 hand-mapped entries:

The model correctly fixed 2 entries that I had gotten wrong.
It matched 682 (75.6%) of my hand-labeled mappings.
It missed 218 (24.1%) of the mappings, frequently using made-up enum values.

This is reasonably good performance, as finding wrong entries is pretty quick (and many could be fixed with find and replace).

The “make” of a vehicle is the brand of the manfacturer, like ‘Honda’, ‘Ford’, ‘Tesla’, etc. ↩