Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore characters that we can't encode #14

Merged
merged 1 commit into from Jul 1, 2018
Merged

Ignore characters that we can't encode #14

merged 1 commit into from Jul 1, 2018

Conversation

hutcheon
Copy link
Contributor

@hutcheon hutcheon commented Jul 1, 2018

Deduplicating digest sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ, url https://static.xx.fbcdn.net/rsrc.php/v3icqO4/y_/l/it_IT/3d-bkgrd-16-2x.jpg
Deduplicating digest sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ, url https://static.xx.fbcdn.net/rsrc.php/v3ikjC4/yE/l/it_IT/PEG.js
Deduplicating digest sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ, url https://static.xx.fbcdn.net/rsrc.php/v3i3TB4/yJ/l/zh_CN/Photo.jpg
Deduplicating digest sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ, url https://static.xx.fbcdn.net/rsrc.php/v3i_854/yI/l/zh_CN/%E6%A0%BC%E5%BC%8F%EF%BC%9A.PNG%E3%80%81.JPG%E3%80%81.JPEG
Traceback (most recent call last):
  File "dedupe.py", line 81, in <module>
    process(filename_in, filename_out)
  File "dedupe.py", line 71, in process
    writer.write_record(record)
  File "/data/data/projects/newsgrabber-731f176/warcio/warcwriter.py", line 325, in write_record
    self._write_warc_record(self.out, record)
  File "/data/data/projects/newsgrabber-731f176/warcio/warcwriter.py", line 225, in _write_warc_record
    self._set_header_buff(record)
  File "/data/data/projects/newsgrabber-731f176/warcio/warcwriter.py", line 217, in _set_header_buff
    headers_buff = record.http_headers.to_bytes(self.header_filter)
  File "/data/data/projects/newsgrabber-731f176/warcio/statusandheaders.py", line 148, in to_bytes
    return self.to_str(filter_func).encode('iso-8859-1') + b'\r\n'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 179: ordinal not in range(128)
Process DeduplicateWarcExtProc returned exit code 1 for Item newsbuddy:warrior_5_1530383124.14
Failed DeduplicateWarcExtProc for Item newsbuddy:warrior_5_1530383124.14`

@km09 km09 merged commit 732f176 into ArchiveTeam:master Jul 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants