Thursday 21 June 2012

Byte streams vs Character streams in Java

Byte streams are generally designed to deal with "raw" data (like image file,mp3 etc.) from a file or stream.A byte stream access the file byte by byte. A byte stream is suitable for any kind of file, however not quite appropriate for text files. For example, if the file is using a Unicode encoding and a character is represented with two bytes, the byte stream will treat these separately and you will need to do the conversion yourself.


A character stream will read a file character by character. The character streams are capable to read 16-bit characters (byte streams read 8-bit characters). Character streams are capable to translate implicitly 8-bit data to 16-bit data or vice versa. Character stream can support all types of character sets ASCII, Unicode, UTF-8, UTF-16 etc.But byte stream is suitable only for ASCII character set.The Java platform stores character values using Unicode conventions. Character stream I/O automatically translates this internal format to and from the local character set.


Unless you are working with binary data, such as image and sound files, you should use readers and writers (character streams) to read and write information for the following reasons:

  • They can handle any character in the Unicode character set (while the byte streams are limited to ISO-Latin-1 8-bit bytes).
  • They are easier to internationalize because they are not dependent upon a specific character encoding.
  • They use buffering techniques internally and are therefore potentially much more efficient than byte streams.

If byte stream is used for read/write a text file, then the coder has to take responsibility to convert bytes to characters correctly, there's not always a one-to-one correspondence between a byte and a character.The byte stream reads only 8 bits at a time, but in Unicode notation each character occupies 16 bits.So coder has to write logic to read 8 bits twice from byte stream and convert that to a unicode character.One more challenge is that,it would be the coder's responsibility to deal with line breaks correctly.


Character streams supports line-oriented I/O.Character I/O usually occurs in bigger units than single characters. One common unit is the line: a string of characters with a line terminator at the end. A line terminator can be a carriage-return/line-feed sequence ("\r\n"), a single carriage-return ("\r"), or a single line-feed ("\n"). Supporting all possible line terminators allows programs to read text files created on any of the widely used operating systems


All byte stream classes are descended from java.io.InputStream and java.io.OutputStream. Character streams are implemented by the java.io.Reader and java.io.Writer classes and their subclasses.  

The Stream Classes


Most of the classes that work directly with streams are part of the java.io package. The two main classes are java.io.InputStream and java.io.OutputStream. These are abstract base classes for many different subclasses with more specialized abilities, including:

BufferedInputStream BufferedOutputStream
ByteArrayInputStream ByteArrayOutputStream
DataInputStream DataOutputStream
FileInputStream FileOutputStream
FilterInputStream FilterOutputStream
ObjectOutputStream PipedInputStream
PipedOutputStream PrintStream
PushbackInputStream SequenceInputStream

Readers and Writers


The java.io.Reader and java.io.Writer classes are abstract super classes for classes that read and write character-based data. The subclasses are notable for handling the conversion between different character sets. There are nine reader and eight writer classes in the core Java API, all in the java.io package:


BufferedReader BufferedWriter
CharArrayReader
CharArrayWriter
FileReader
FileWriter
FilterReader
FilterWriter
InputStreamReader
LineNumberReader
OutputStreamWriter
PipedReader
PipedWriter
PrintWriter
PushbackReader
StringReader
StringWriter




For the most part, these classes have methods that are extremely similar to the equivalent stream classes. Often the only difference is that a byte in the signature of a stream method is replaced by a char in the signature of the matching reader or writer method.
For example, the

java.io.OutputStream class declares these three write() methods:

public abstract void write(int i) throws IOException
public void write(byte[] data) throws IOException
public void write(byte[] data, int offset, int length) throws IOException

The java.io.Writer class, therefore, declares these three write() methods:

public void write(int i) throws IOException
public void write(char[] data) throws IOException
public abstract void write(char[] data, int offset, int length) throws
IOException

As you can see, the six signatures are identical except that in the latter two methods the byte array data has changed to a char array. There's also a less obvious difference not reflected in the signature. While the int passed to the OutputStream write() method is reduced modulo 256 before being output, the int passed to the Writer write() method is reduced modulo 65,536. This reflects the different ranges of chars and bytes.

java.io.Writer also has two more write() methods that take their data from a string:
public void write(String s) throws IOException
public void write(String s, int offset, int length) throws IOException

Because streams don't know how to deal with character-based data, there are no corresponding methods in the java.io.OutputStream class.

2 comments: