- Notifications
You must be signed in to change notification settings - Fork302
Description
Issue:
Thendash
characters inword_count.txt
cause an error when following the "Run your first Spark Job" tutorial. There are only two occurences of this character here: "from 1913–74.
" and here: "near–bankruptcy
".
To Recreate:
usingspark-2.3.2-bin-hadoop2.7
on Ubuntu18, pyspark/python 2.7, Installed following instructions from lecture 5, go to directory where you clonedpython-spark-tutorial
and run the following from lecture 6:
spark-submit ./rdd/WordCount.py
The execution halts about halfway through the frequency counter with the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position4: ordinal not in range(128)
Spoiler, it's the dash. I'm not sure whether or not the utf16 dash was intentional, so I'm posting.
Work-Around:
I changed the twondash
characters to "from 1913-74.
" and "near-bankruptcy
", which solved the issue for me. Relatedstackoverflow thread where someone else ran into a similar problem with python2.7 and used the same solution.