Rescuing your data from Formstack’s “excess form submission” email

I seem to be accumulating all sorts of little scripts to solve odd little problems sensible people don’t encounter (and therefore don’t have scripts for). Normally I keep these to myself because they’re really ugly, but I’m going to start sharing the more useful ones on my blog. Maybe some day I’ll save someone a bit of time. That would be fun!

Using a Formstack form? Went over your form submission limit and now your email is flooded with thousands of useless error messages?

Yeah, that happened to us, too. (If it didn’t happen to you, just ignore me and carry on not being miserable).

Usually, Formstack stores submissions on its end (and they do a decent enough job of that), but after a point that stops being allowed and the service sends an email for each submission, instead. Thousands of really poorly formatted emails. I think they’re trying to sell something…

So, here is a Python script to turn all that Formstack email back into a single CSV file that you can actually use. The end result looks just like a CSV you would normally download from them, so you can get everything back to normal.

#!/usr/bin/python

# Formstack email scraper, by Dylan McCall, https://dylanmc.ca//-

import argparse

import codecs
import sys
import csv

# Fields that have arbitrary, user-entered text as values, so the parser knows
# to ignore reserved characters like "="
FREE_INPUT_FIELD_NAMES = [
	"Other",
	"Please specify",
	"Do you have any other comments about toxic ingredients in home cleaning products?"
]

def main():
	parser = argparse.ArgumentParser(description="Convert a CSV file full of Formstack overflow emails back to a useful format.")

	parser.add_argument('input', metavar='mail',
		help="The CSV file with Formstack email text in its \"Body\" column")
	parser.add_argument('output', metavar='output',
		help="A file to store the results in")

	parser.add_argument('--input-delimiter', default=',',
		help="Value separator used for input file. Escape sequences like \\t are allowed.")
	parser.add_argument('--headers', default='headers.csv',
		help="A CSV file from Formstack's Export Submissions feature. Only the top part (column names) will be used.")
	parser.add_argument('--unique-id-start', metavar="N", default=99900000, type=int,
		help="Starting point for unique submission IDs")

	args = parser.parse_args()
	args.input_delimiter = args.input_delimiter.decode('string_escape')

	try:
		headerFile = open(args.headers, 'r')
	except IOError as error:
		print "I couldn't open %s." % args.headers
		print "Please use Formstack's Export Submissions feature to download form"
		print "submissions as a CSV file. Save the file to the same place as this"
		print "script, and name it headers.csv."
		print "The file doesn't need any real submissions. I only need it for the"
		print "row of column names at the top."
		sys.exit()

	inFile = open(args.input, 'r')
	outFile = open(args.output, 'wb')

	submissionData = []

	# This bit of code really hinges on how your mail format works.
	# Right now, it's written for Microsoft Outlook's "Export messages as CSV"
	# feature, which has a Body column for message text
	mailReader = csv.DictReader(inFile, delimiter=args.input_delimiter, quotechar='"')
	for row in mailReader:
		mailBody = row['Body']
		data = getDataFromMail(mailBody)
		# Some messages may not have interesting data, so we ignore those
		if len(data) > 0:
			submissionData.append(data)

	# Field names can repeat, so we need to figure out the order they appear
	columns = []
	headerReader = csv.reader(headerFile, delimiter=',', quotechar='"')
	columns.extend(headerReader.next())

	dataWriter = csv.writer(outFile)
	dataWriter.writerow(columns) # Print column names at top of CSV file

	for index, data in enumerate(submissionData):
		# Populate the Unique ID field, which isn't in the message body
		data['Unique ID'] = args.unique_id_start + index

		outputRow = list(['']) * len(columns)
		for fieldID, fieldValue in data.iteritems():
			fieldParts = fieldID.split('###', 2)
			fieldName = fieldParts[0]
			fieldNumber = int(fieldParts[1]) if len(fieldParts) > 1 else 0

			# Find the [fieldNumber]th repetition of the column called fieldName
			columnRepeats = 0
			colNumber = findNth(columns, fieldName, fieldNumber)
			if colNumber >= 0:
				outputRow[colNumber] = fieldValue

		dataWriter.writerow(outputRow)

def findInList(lst, needle, start=0):
	try:
		position = lst.index(needle, start)
	except:
		position = -1
	return position

def findNth(lst, needle, n):
	needleRepeats = 0
	foundIndex = findInList(lst, needle, 0)
	while needleRepeats < n and foundIndex >= 0:
		needleRepeats += 1
		foundIndex = findInList(lst, needle, foundIndex+1)
	return foundIndex

'''
Extract all of the fields listed in the given message body.
@return dictionary of raw field data, mapping a field name+++number to its value
'''
def getDataFromMail(text):
	fields = {}

	IGNORE_COLUMNS = frozenset([
		"WARNING" #Formstack's warning message preceding information dump
	])

	lines = text.split('\n')

	currentKey = None
	currentValue = None

	for line in lines:
		if currentKey == None:
			# Looking for the start of a new field

			if line.find(': ') > -1:
				# Start of a new field.
				# This field can span many lines. Store currentKey so the next lines will be appended.
				# Record field name and first value (after ": ") if applicable
				key, val = line.split(': ', 1)
				key = key.strip()
				val = val.strip()

				if not key in IGNORE_COLUMNS:
					currentKey = key
					currentValue = list()
					if val:
						currentValue.append(val)

		elif currentKey and line.strip() == '':
			# Blank line always means the end of the current field.

			if len(currentValue) == 0:
				currentValue.append('')

			processedKeyValues = processField( (currentKey, currentValue) )

			for finalKey, finalValue in processedKeyValues:
				duplicates = 0
				uniqueFinalKey = "%s###%d" % (finalKey, duplicates)
				while uniqueFinalKey in fields:
					duplicates += 1
					uniqueFinalKey = "%s###%d" % (finalKey, duplicates)

				fields[uniqueFinalKey] = finalValue

			currentKey = None
			currentValue = None

		elif currentKey:
			currentValue.append(line.strip())

	return fields

def processField(field):
	fieldKey = field[0]
	fieldValue = field[1]

	orderedSubFieldKeys = list()
	subFields = dict()

	data = list()

	for value in fieldValue:
		# This behaves poorly for values that, themselves, contain " = "

		# Ugly hack for cases with user input
		if fieldKey in FREE_INPUT_FIELD_NAMES or value.startswith("Other:"):
			splitValue = [value]
		else:
			splitValue = value.split(' = ', 1)

		subFieldKey = ''
		subFieldValue = ''

		if len(splitValue) == 2:
			subFieldKey = splitValue[0]
			subFieldValue = splitValue[1]
		elif len(splitValue) == 1:
			subFieldValue = splitValue[0]

		if not subFieldKey in orderedSubFieldKeys:
			orderedSubFieldKeys.append(subFieldKey)
			subFields[subFieldKey] = list()
		subFields[subFieldKey].append(subFieldValue)

	for subFieldKey in orderedSubFieldKeys:
		if subFieldKey == '':
			completeHeader = fieldKey
		else:
			completeHeader = "%s (%s)" % (fieldKey, subFieldKey)

		subFieldValueList = subFields[subFieldKey]
		completeData = '|'.join(subFieldValueList)

		#Final output!
		data.append( (completeHeader, completeData) )

	return data

if __name__ == "__main__":
	main()

Using this is a little involved. But that’s why you’re still here, right?

First, put that script somewhere. Name it extract-data.py. If you haven’t guessed, you’ll need Python. I’m assuming you’re using Ubuntu, MacOS X or something of the sort, but Windows should work as well.

Now, you need two more files.

Go to the Form Submissions page for your form on Formstack. Click the button near the bottom, to export some submissions as a CSV file. I don’t really care how many submissions you export: this output is just being used to get a proper set of column headers. Save that CSV file as headers.csv and put it in the same place as our Python script.

Next, we need your email, ideally in something like a CSV file. In this case, I was working with a big CSV file from Microsoft Outlook. (We actually used tab-separated values, because Outlook was doing something nightmarish with character encodings. Somewhere between exporting as a TSV and opening the thing differently in LibreOffice, something started to cooperate). Mozilla Thunderbird has an extension to export a folder as a CSV file, too, and it probably won’t be trying to murder kittens.

Now, because there’s no real standard way to do a CSV-formatted heap of messages, you might need to edit the script to get it to read your messages. There’s a little bit of code at the top — basically everything involving mailReader — which you can edit. It just loops through each message in the CSV file and passes that message’s body text to the message scraper. (If you’re feeling creative, you could use an mbox file with Python’s mailbox library).

While you’re at it, you need to edit another variable. FREE_INPUT_FIELD_NAMES is a list of form fields where people can enter whatever they want. (Text fields, for example). You need to add every field that is like this on your form. Otherwise, the scraper will trip over peoples’ inputs.

…Okay, now we’re ready to go!

Open a terminal, go to the place with your script (and mail.csv, and headers.csv), and do this:

python extract-data.py mail.csv output.csv --headers=headers.csv

Boom! Magical, formatted data.

The resulting file, output.csv, should look something like this:

Of course, please be gentle and don’t take this as an excuse to misuse Formstack’s service. They probably wouldn’t be happy with that and they’re being very kind by processing submissions above the maximum at all. Especially, in our case, 1600 over the maximum :)

We used this to recover our data and, barring the hours of cursing and hair-pulling as I tried to make this scraper work, it was quick and painless. I hope it helps someone else out there, too.

2 Replies to “Rescuing your data from Formstack’s “excess form submission” email”

  1. Oh, I should mention why this is as long as it is (or, alternately, why it took so long to write given the length).

    In Formstack’s submission email, each field is listed in a “Key: Value” format, but the value can span multiple lines. Meanwhile, there are multiple choice questions, where each choice (in the value section) is meant to be “|” separated; matrix questions, where each row in the matrix gets its own column (all represented in the value section); and free-form inputs, where the value can be absolutely anything.

    Formstack’s email format crams those together with little to differentiate them, and it gets especially fiddly when we get to free-form input because people can happily write “= ” in those fields, even though that character is used as a delimiter between matrix column names and choices. It took a surprising amount of tuning to get all that madness back into the same format as their usual data export.

    (Also, I think using optparser is fun).

    If someone finds a simpler solution to scraping Formstack messages, I’ll be interested to see it. So, please share if you have one!

Comments are closed.