I would like to read a csv file using Python 3.5 and Pandas 0.18.1.
import pandas as pd
df = pd.read_csv(directory, skiprows=3)
However, I got the following error.
CParserError: Error tokenizing data. C error: EOF inside string starting at line 353996
If I check the csv file in the corresponding line, I found that this line is missing a double quotation mark.
with open(directory) as f:
content = f.readlines()
f.close()
for line in content:
print(line)
Here is the problematic line.
320597,83222081,Ratna,2031,320597073,NEKSA (TO),O-Pre Sales,Active,30136,8242015,RAYON ZEUS 8,20170222,"""Jl.Kamboja No.29, Lrg.Perburuan"", Lr.Perburuan, 1671040010, 16
So after '16' it should have been ended by
, "
. Here is the example of the other line that did not result in CParseError.320597,83222080,Kartika,2031,320597073,NEKSA (TO),O-Pre Sales,Active,30136,8242015,RAYON ZEUS 8,20170222,"""Jl.Dwikora 2 No.2924, Lrg.Tinta Mas"", Lr.Tinta Mas, 1671040010, 1671040, 1671,16, "
In order to fix this issue, we have to do workaround. Currently what I am thinking is that modify the problematic line by manually assign the missing characters. However, if you have a better solution, please do let me and my readers know.
First, we have to know the line number of the problematic line. In order to get the line number, we need to catch the exception. Afterwards, I fed the exception message (after converting it to
str
) to the new method called fix_and_read_csv
.try:
df = pd.read_csv(directory)
except Exception as e:
e = str(e)
df = fix_and_read_csv(e, directory, append_chars=', "')
So what is this
fix_and_read_csv
? Basically, this method will:- open the same
.csv
file that we have already opened before and returned an error - save it to temporary file read/write
- open in write mode the same
.csv
file - write the
.csv
file from the beginning - fix the problematic line by appending at the end of the line with the missing characters
- close and save the
.csv
file.
Here is my fix_and_read_csv code.
def fix_and_read_csv(exception_msg, directory, append_chars):
# step I: get the line number that is problematic
temp = exception_msg.split()
for item in temp:
try:
error_line = int(item)
except:
continue
# step II: fix the problematic line
modify_file(directory, error_line, append_chars)
# step III: create the dataframe and return it
df = pd.read_csv(directory, usecols=usecols, skiprows=skiprows)
return df
Inside the
fix_and_read_csv
method, I embedded one more method to read and then write back the fixed line to the same file. I named it modify_file
. Here is the modify_file
code.def modify_file(filename, error_line_at, append_chars):
import tempfile
error_line_at = error_line_at + 6
#Create temporary file read/write
t = tempfile.NamedTemporaryFile(mode="r+")
#Open input file read-only
i = open(filename, 'r')
#Copy input file to temporary file, modifying as we go
for line in i:
t.write(line)
i.close() #Close input file
t.seek(0) #Rewind temporary file to beginning
o = open(filename, "w") #Reopen input file writable
#Overwriting original file with temporary file contents
i = -1
for line in t:
i += 1
if i != error_line_at: o.write(line)
else: o.write(line+append_chars)
t.close() #Close temporary file, will cause it to be deleted
Again, I am sorry I could not continue my previous post this time. The style of the post is also a bit different, because I just copied and pasted my post in stackoverflow.com to here. Nonetheless, I hope that you still enjoy reading this week's post. As always, thanks for reading. Should there be any query or comment, please leave them down in the comment section below. See you guys in the next post!
In most cases, it might be an issue with:
ReplyDeletethe delimiters in your data.
confused by the headers/column of the file.
The error tokenizing data may arise when you're using separator (for eg. comma ',') as a delimiter and you have more separator than expected (more fields in the error row than defined in the header). So you need to either remove the additional field or remove the extra separator if it's there by mistake. The better solution is to investigate the offending file and to fix it manually so you don't need to skip the error lines.
Good content. You write beautiful things.
ReplyDeletetaksi
vbet
hacklink
sportsbet
korsan taksi
vbet
mrbahis
hacklink
sportsbet
başakşehir
ReplyDeletebayrampaşa
beşiktaş
beykoz
beylikdüzü
H77HH
https://saglamproxy.com
ReplyDeletemetin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
Z1S