I have a UTF-8 encoded text file with one token per line. I would like to read it into a vector. This is on MSWindows, version 3.0.1. I understand that the default encoding is UTF-8, right?
I am looking for a code snippet like the ones on
http://www.mayin.org/ajayshah/KB/R/html/r4.html
from 'R by example'
http://www.mayin.org/ajayshah/KB/R/index.html
However they do not have a UTF-8 example, only ASCII.
You can either read it in with read.table() and then extract the column as a vector, or with scan().
vect <- scan(file="path/to/file1.txt", what=character(0) )
You would not need to use UTF-8 as the encoding, since you know that it is the default, but there is the option of doing so:
vect <- scan(file="path/to/file1.txt", what=character(0), encoding="UTF-8" )
The NEWS file for R 3.0.0 said:
" o readLines() and scan() (and hence read.table()) in a UTF-8 locale now discard a UTF-8 byte-order-mark (BOM). Such BOMs are allowed but not recommended by the Unicode Standard: however Microsoft applications can produce them and so they are sometimes found on websites.
The encoding name "UTF-8-BOM" for a connection will ensure that a UTF-8 BOM is discarded. "
So perhaps the need for the encoding argument indicated either that you were in a nonUTF-8 locale and didn't tell us or that you were using an outdated R version?
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments