Spark 결과 인코딩 오류 'ascii'코덱이있는 HDInsight 클러스터의 UTF-8 텍스트는 위치에있는 문자를 인코딩 할 수 없습니다. 서 수가 범위에 없습니다 (128).

debugcn 에 게시 Dev

가이 베 텐탈

Linux에서 Spark를 사용하여 HDInsight 클러스터에서 히브리어 문자 UTF-8 TSV 파일로 작업하려고하는데 인코딩 오류, 권장 사항이 있습니까?

내 pyspark 노트북 코드가 있습니다.

from pyspark.sql import *
# Create an RDD from sample data
transactionsText = sc.textFile("/people.txt")

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header) .map(lambda l: l.encode('utf-8').split("\t"))
transactions = transactionsParts.map(lambda p: Entry(str(p[0]),str(p[1]),int(p[2])))

# Infer the schema and create a table       
transactionsTable = sqlContext.createDataFrame(transactions)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM transactionsTempTable")

# The results of SQL queries are RDDs and support all the normal RDD operations.
names = results.map(lambda p: "name: " + p.name)

for name in names.collect():
  print(name)

오류:

'ascii'코덱이 위치 6-11의 문자를 인코딩 할 수 없음 : 서 수가 범위 내에 없음 (128) 역 추적 (가장 최근 호출 마지막 호출) : UnicodeEncodeError : 'ascii'코덱이 위치 6-11의 문자를 인코딩 할 수 없음 : 서 수가 범위에 없음 범위 (128)

히브리어 텍스트 파일 내용 :

id  name    age 
1   גיא 37
2   maor    32 
3   danny   55

영어 파일을 시도하면 제대로 작동합니다.

영어 텍스트 파일 내용 :

id  name    age
1   guy     37
2   maor    32
3   danny   55

산출:

name: guy
name: maor
name: danny

aggFTW

히브리어 텍스트로 다음 코드를 실행하는 경우 :

from pyspark.sql import *

path = "/people.txt"
transactionsText = sc.textFile(path)

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header).map(lambda l: l.split("\t"))

transactions = transactionsParts.map(lambda p: Entry(unicode(p[0]), unicode(p[1]), unicode(p[2])))

transactions.collect()

unicode유형 목록으로 이름을 얻는 것을 알 수 있습니다 .

[Row(id=u'1', name=u'\u05d2\u05d9\u05d0', age=u'37'), Row(id=u'2', name=u'maor', age=u'32 '), Row(id=u'3', name=u'danny', age=u'55')]

이제 트랜잭션 RDD로 테이블을 등록합니다.

table_name = "transactionsTempTable"

# Infer the schema and create a table       
transactionsDf = sqlContext.createDataFrame(transactions)
transactionsDf.registerTempTable(table_name)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM {}".format(table_name))

results.collect()

DataFrame돌아 오는 Pyspark의 모든 문자열 sqlContext.sql(...이 Python unicode유형 임을 알 수 있습니다 .

[Row(name=u'\u05d2\u05d9\u05d0'), Row(name=u'maor'), Row(name=u'danny')]

이제 실행 중 :

%%sql
SELECT * FROM transactionsTempTable

예상 결과를 얻을 수 있습니다.

name: גיא
name: maor
name: danny

해당 이름에 대해 작업을 수행하려면 unicode문자열 로 작업하고 싶을 것 입니다. 에서 이 문서 :

텍스트 조작 (문자열의 문자 수 찾기 또는 단어 경계에서 문자열 자르기)을 처리 할 때 문자 시퀀스로 생각하기에 적합한 방식으로 문자를 추상화하므로 유니 코드 문자열을 처리해야합니다. 페이지에서 볼 수 있습니다. I / O를 다룰 때, 디스크에서 읽고, 터미널로 인쇄하고, 네트워크 링크를 통해 무언가를 보내는 등의 경우, 해당 장치가 어떤 바이트의 구체적인 구현을 처리해야하므로 byte str을 처리해야합니다. 당신의 추상적 인 성격을 나타냅니다.

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정2021-06-9

몇 마디 만하겠습니다

0리뷰

로그인참여 후 검토

Related 관련 기사

기사