How to read a UTF-8 encoded list of string tokens into a vector?

debugcn Published at Dev

z--

I have a UTF-8 encoded text file with one token per line. I would like to read it into a vector. This is on MSWindows, version 3.0.1. I understand that the default encoding is UTF-8, right?

I am looking for a code snippet like the ones on

http://www.mayin.org/ajayshah/KB/R/html/r4.html

from 'R by example'

http://www.mayin.org/ajayshah/KB/R/index.html

However they do not have a UTF-8 example, only ASCII.

IRTFM

You can either read it in with read.table() and then extract the column as a vector, or with scan().

 vect <- scan(file="path/to/file1.txt", what=character(0) )

You would not need to use UTF-8 as the encoding, since you know that it is the default, but there is the option of doing so:

vect <- scan(file="path/to/file1.txt", what=character(0), encoding="UTF-8" )

The NEWS file for R 3.0.0 said:

" o readLines() and scan() (and hence read.table()) in a UTF-8 locale now discard a UTF-8 byte-order-mark (BOM). Such BOMs are allowed but not recommended by the Unicode Standard: however Microsoft applications can produce them and so they are sometimes found on websites.

The encoding name "UTF-8-BOM" for a connection will ensure that a UTF-8 BOM is discarded. "

So perhaps the need for the encoding argument indicated either that you were in a nonUTF-8 locale and didn't tell us or that you were using an outdated R version?

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-06-20

Comments

0 comments

From Dev

Related Related

Article

How to read a UTF-8 encoded list of string tokens into a vector?

How to read a UTF-8 encoded list of string tokens into a vector?

How to convert wrongly encoded data to UTF-8?

How to read an UTF-8 encoded file containing Chinese characters and output them correctly on console?

How can I know if url-encoded string is UTF-8 or Latin-1 with PHP?

How to decode an utf8 encoded string split in two buffers right in between a 4 byte long char?

Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding

How to print UTF-8 encoded charecters in JSoup

how to get utf8 encoded html content

How to output a utf-8 string list as it is in python?

How to embed utf8 encoded html into element

How to deal with non UTF-8 encoded urls in express

Getting UTF-8 encoded from US-ASCII encoded string

Converting url encoded string(utf-8) to string in python?

How do you convert a base64 utf-8 encoded string to a binary file from bash?

std::string and UTF-8 encoded unicode

Getting a utf-8 encoded string from a database, then displaying in a webview

How to convert wrongly encoded data to UTF-8?

How to decode to UTF-8 String from Hex encoded string

How to read a string into a vector

How to read right a utf-8 string in the serlvet?

How to embed utf8 encoded html into element

Getting UTF-8 encoded from US-ASCII encoded string

How to convert a string encoded in utf16 to a string encoded in UTF-8?

Byte array is a valid UTF8 encoded String in Java but not in Python

nodejs UTF-8 encoded string has black question mark

How to convert large UTF-8 encoded char* string to CStringW (UTF-16)?

How to read collapsed UTF-8 string

How to get a file list as utf8 encoded string into gnuplot?

How to read a GBK-encoded file into a String?

comparing a url containing utf-8 encoded string with a string