Friday, February 24, 2017

Pandas Workaround of CParseError: Error tokenizing data. C error: EOF inside string starting at line

This week I posted a bit different post than my previous posts. Originally I was writing this in stackoverflow.com. However, before I posted my question, I made sure that I have already tried every options that might solve my issue. And it turned out that my last attempt worked! So I declined to post this in stackoverflow.com. Nevertheless, since I thought that I might need to re-look at this workaround in solving the CParseError in Pandas, thus, I decided to post it in my blog.

I would like to read a csv file using Python 3.5 and Pandas 0.18.1.
import pandas as pd
df = pd.read_csv(directory, skiprows=3)
However, I got the following error.
CParserError: Error tokenizing data. C error: EOF inside string starting at line 353996


If I check the csv file in the corresponding line, I found that this line is missing a double quotation mark.
with open(directory) as f:
    content = f.readlines()
f.close()

for line in content:
    print(line)
Here is the problematic line.
320597,83222081,Ratna,2031,320597073,NEKSA (TO),O-Pre Sales,Active,30136,8242015,RAYON ZEUS 8,20170222,"""Jl.Kamboja No.29, Lrg.Perburuan"", Lr.Perburuan, 1671040010, 16
So after '16' it should have been ended by , ". Here is the example of the other line that did not result in CParseError.
320597,83222080,Kartika,2031,320597073,NEKSA (TO),O-Pre Sales,Active,30136,8242015,RAYON ZEUS 8,20170222,"""Jl.Dwikora 2 No.2924, Lrg.Tinta Mas"", Lr.Tinta Mas, 1671040010, 1671040, 1671,16, "
In order to fix this issue, we have to do workaround. Currently what I am thinking is that modify the problematic line by manually assign the missing characters. However, if you have a better solution, please do let me and my readers know.
First, we have to know the line number of the problematic line. In order to get the line number, we need to catch the exception. Afterwards, I fed the exception message (after converting it to str) to the new method called fix_and_read_csv.
try:
    df = pd.read_csv(directory)
except Exception as e:
    e = str(e)
    df = fix_and_read_csv(e, directory, append_chars=', "')
So what is this fix_and_read_csv? Basically, this method will:
  1. open the same .csv file that we have already opened before and returned an error
  2. save it to temporary file read/write
  3. open in write mode the same .csv file
  4. write the .csv file from the beginning
  5. fix the problematic line by appending at the end of the line with the missing characters
  6. close and save the .csv file.

Here is my fix_and_read_csv code.
def fix_and_read_csv(exception_msg, directory, append_chars):
    # step I: get the line number that is problematic
    temp = exception_msg.split()
    for item in temp:
        try:
            error_line = int(item)
        except:
            continue

    # step II: fix the problematic line
    modify_file(directory, error_line, append_chars)

    # step III: create the dataframe and return it
    df = pd.read_csv(directory, usecols=usecols, skiprows=skiprows)
    return df
Inside the fix_and_read_csv method, I embedded one more method to read and then write back the fixed line to the same file. I named it modify_file. Here is the modify_file code.
def modify_file(filename, error_line_at, append_chars):
    import tempfile

    error_line_at = error_line_at + 6

    #Create temporary file read/write
    t = tempfile.NamedTemporaryFile(mode="r+")

    #Open input file read-only
    i = open(filename, 'r')

    #Copy input file to temporary file, modifying as we go
    for line in i:
        t.write(line)

    i.close() #Close input file

    t.seek(0) #Rewind temporary file to beginning

    o = open(filename, "w")  #Reopen input file writable

    #Overwriting original file with temporary file contents          
    i = -1
    for line in t:
        i += 1
        if i != error_line_at: o.write(line)
        else: o.write(line+append_chars)

    t.close() #Close temporary file, will cause it to be deleted
Again, I am sorry I could not continue my previous post this time. The style of the post is also a bit different, because I just copied and pasted my post in stackoverflow.com to here. Nonetheless, I hope that you still enjoy reading this week's post. As always, thanks for reading. Should there be any query or comment, please leave them down in the comment section below. See you guys in the next post!

4 comments:

  1. In most cases, it might be an issue with:

    the delimiters in your data.
    confused by the headers/column of the file.

    The error tokenizing data may arise when you're using separator (for eg. comma ',') as a delimiter and you have more separator than expected (more fields in the error row than defined in the header). So you need to either remove the additional field or remove the extra separator if it's there by mistake. The better solution is to investigate the offending file and to fix it manually so you don't need to skip the error lines.

    ReplyDelete