Spark saveAsTextFile 怎么设置字符集啊

0
已邀请:
0

MarsJ - 大数据玩家~DS 2016-06-23 回答

saveAsTextFile实际上使用了Hadoop中的Text(这个的编码字符是UTF-8),看一下下面的Source Code:
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]) {
this.map(x => (NullWritable.get(), new Text(x.toString)))
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec)
}
Text.java是这样的:
public class Text extends BinaryComparable
implements WritableComparable<BinaryComparable> {

static final int SHORT_STRING_MAX = 1024 * 1024;

private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
new ThreadLocal<CharsetEncoder>() {
protected CharsetEncoder initialValue() {
return Charset.forName("UTF-8").newEncoder().
onMalformedInput(CodingErrorAction.REPORT).
onUnmappableCharacter(CodingErrorAction.REPORT);
}
};

private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
new ThreadLocal<CharsetDecoder>() {
protected CharsetDecoder initialValue() {
return Charset.forName("UTF-8").newDecoder().
onMalformedInput(CodingErrorAction.REPORT).
onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
如果说你要换成另一种,例如UTF-16,可能需要使用saveAsHadoopFile和org.apache.hadoop.io.BytesWritable:
saveAsHadoopFile[SequenceFileOutputFormat[NullWritable, BytesWritable]](path)
可以通过getBytes("UTF-16")来指定吧。
 
不知道是不是你需要的解决方案。仅供参考
0

Siyuan Ding 2016-06-27 回答

谢谢

要回复问题请先登录注册