In Ruby, how do I deal with non-UTF 8 characters in PDF content?

Dave

I’m using Rails 4.2.7. I’m downloading and writing PDF content from the web, like so …

    res1 = Net::HTTP.SOCKSProxy('127.0.0.1', 50001).start(uri.host, uri.port) do |http|
      puts "launching #{uri}"
      resp = http.get(uri)
      status = resp.code
      content = resp.body
      content_type = resp['content-type']
      content_encoding = resp['content-encoding']
    end
…
  if content_type == 'application/pdf' || content_type.include?('application/x-javascript')
    File.open(file_location, "w") { |file| file.write content }

I’m noticing that for some content, I get the below error

Error during processing: "\xC2" from ASCII-8BIT to UTF-8
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `write'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `block in pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `open'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:76:in `process_race_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_race_finder_service.rb:75:in `process_race_link'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:29:in `block in process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `each'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `process_data'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each'

I tried accounting for it, by replacing invalid characters, like so …

File.open(file_location, "w") { |file| file.write content }
content.encode('UTF-8', :invalid => :replace, :undef => :replace)

but then I get the error

error: PDF malformed, expected 'endstream' but found 0 instead

when trying to read the PDF file. Does anyone know of a better way to deal with downloaded PDF docs that won’t corrupt them?

Aleksei Matiushkin

I think the easiest solution would be to write it as is using IO#binwrite:

File.binwrite(file_location, content)

The above might fail, if files you receive might be in different encodings, In that case I would try to

content.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How do I convert UTF-8 special characters in Bash?

From Dev

How to deal with non UTF-8 encoded urls in express

From Dev

How do I deal with special characters in vb.net?

From Dev

How do I deal with special characters in vb.net?

From Dev

How do I display UTF-8 characters sent through a websocket?

From Dev

How to deal with invalid utf8 in fileinput?

From Java

How do I deal with null and duplicate values in a Java 8 Comparator?

From Dev

How do I deal with null and duplicate values in a Java 8 Comparator?

From Dev

How to clear non-utf characters while reading a utf-8 file in Perl?

From Dev

Decoding non standard characters to UTF 8 in Python

From Dev

Solr Query non UTF-8 Characters

From Dev

How do I write utf-8 characters( '\xe7\x8e\xa9' ) into another file as Chinese characters?

From Dev

How do I make the toString() of the JSONObject encode the UTF-8 characters to unicode like in json_encode of PHP?

From Dev

Ruby, Nokogiri: how do i ensure UTF8 throughout nokogiri parsing, erb template, and encoding HTML file

From Dev

How do I run Ruby on Windows 8?

From Dev

How can I deal with this list with ruby?

From Dev

how to write UTF-8 (persian) characters in pdf export of dynamic reports?

From Dev

how to write UTF-8 (persian) characters in pdf export of dynamic reports?

From Dev

how to get UTF8 characters from Mysql to store into PDF in a C# prog by Using Itextsharp?

From Dev

How to skip/remove invalid non-utf8 characters from a xml file

From Dev

How to skip/remove invalid non-utf8 characters from a xml file

From Dev

REST Yii2 - how to display non-UTF8 characters coming from database in json?

From Dev

How to convert string in UTF-8 to ASCII ignoring errors and removing non ASCII characters

From Dev

Java - Count exactly 60 characters from a string with a mixture of UTF-8 and non UTF-8 characters

From Dev

How do I deal with OpenStreetMap Overpass API XML containing references to non-present nodes?

From Dev

How do I deal with non ascii character from CSV when using json.loads in Python?

From Dev

How do I commit with a utf-8 message file?

From Dev

How do I properly handle  in UTF-8 XML?

From Dev

How do I unescape multiple byte character utf8

Related Related

  1. 1

    How do I convert UTF-8 special characters in Bash?

  2. 2

    How to deal with non UTF-8 encoded urls in express

  3. 3

    How do I deal with special characters in vb.net?

  4. 4

    How do I deal with special characters in vb.net?

  5. 5

    How do I display UTF-8 characters sent through a websocket?

  6. 6

    How to deal with invalid utf8 in fileinput?

  7. 7

    How do I deal with null and duplicate values in a Java 8 Comparator?

  8. 8

    How do I deal with null and duplicate values in a Java 8 Comparator?

  9. 9

    How to clear non-utf characters while reading a utf-8 file in Perl?

  10. 10

    Decoding non standard characters to UTF 8 in Python

  11. 11

    Solr Query non UTF-8 Characters

  12. 12

    How do I write utf-8 characters( '\xe7\x8e\xa9' ) into another file as Chinese characters?

  13. 13

    How do I make the toString() of the JSONObject encode the UTF-8 characters to unicode like in json_encode of PHP?

  14. 14

    Ruby, Nokogiri: how do i ensure UTF8 throughout nokogiri parsing, erb template, and encoding HTML file

  15. 15

    How do I run Ruby on Windows 8?

  16. 16

    How can I deal with this list with ruby?

  17. 17

    how to write UTF-8 (persian) characters in pdf export of dynamic reports?

  18. 18

    how to write UTF-8 (persian) characters in pdf export of dynamic reports?

  19. 19

    how to get UTF8 characters from Mysql to store into PDF in a C# prog by Using Itextsharp?

  20. 20

    How to skip/remove invalid non-utf8 characters from a xml file

  21. 21

    How to skip/remove invalid non-utf8 characters from a xml file

  22. 22

    REST Yii2 - how to display non-UTF8 characters coming from database in json?

  23. 23

    How to convert string in UTF-8 to ASCII ignoring errors and removing non ASCII characters

  24. 24

    Java - Count exactly 60 characters from a string with a mixture of UTF-8 and non UTF-8 characters

  25. 25

    How do I deal with OpenStreetMap Overpass API XML containing references to non-present nodes?

  26. 26

    How do I deal with non ascii character from CSV when using json.loads in Python?

  27. 27

    How do I commit with a utf-8 message file?

  28. 28

    How do I properly handle  in UTF-8 XML?

  29. 29

    How do I unescape multiple byte character utf8

HotTag

Archive