sciforums SciForums.com : Technology : Computer Science & Culture
Problem converting text file to binary file
Encyclopedia Register FAQ Members List Social Groups Ban List Calendar Search Today's Posts Mark Forums Read
Reply
 
Thread Tools Rate Thread
Success_Machine's Avatar Success_Machine
Impossible? I can do that (321 posts)
Old 10-10-03, 02:45 AM
 #1
Reply With Quote   Success_Machine is offline
I converted a text file to binary format using Visual Basic 6.0. The resulting binary file is bigger than the original text file.

I thought binary files were supposed to be much smaller? Am I making a mistake, or was I misled?

Last edited by Success_Machine; 10-10-03 at 07:46 AM..
malkiri
Registered Senior User (198 posts)
Old 10-10-03, 09:31 AM
 #2
Reply With Quote   malkiri is offline
Can you elaborate on how exactly you converted it? Text files and binary files are both composed of binary values, but text files conform to a standard (ASCII) where particular patterns of bits represent each character. Technically, a text file is a binary file.
Success_Machine's Avatar Success_Machine
Impossible? I can do that (321 posts)
Old 10-10-03, 11:15 AM
 #3
Reply With Quote   Success_Machine is offline
I had a text file composed of lots of strings and numbers, essentially a database file with thousands of records, each with a dozen fields. I used the following general code:

Option Explicit

Private Type cRecord
field1 as string
field2 as double
field3() as variant
End Type

Private Sub Form_Activate()
dim oneRecord as cRecord

Open "filename.txt" for input as #1
input #1, oneRecord.field1
input #1, oneRecord.field2
redim oneRecord.field3(oneRecord.field2) as variant
for k = 0 to oneRecord.field2
input #1, oneRecord.field3(k)
next k
Close #1

Open "fileout.bin" for binary access write as #2
put #2,, oneRecord
Close #2
End Sub


I think that's pretty much it. It created a binary file larger than the original text file, while all the literature, including the online MSDN library, tells me that binary files are supposed to conserve disk space and run faster!

So what gives?
malkiri
Registered Senior User (198 posts)
Old 10-10-03, 11:33 AM
 #4
Reply With Quote   malkiri is offline
I'm not too familiar with VB, so I can't explain any intricacies. Try this...make a cRecord in your code (instead of reading it from a file), write it to a text file and a binary file, and compare those file sizes. It may be that when they say binary files are smaller, what they mean is that the binary files written programmatically are smaller than text files also written programmatically. The MSDN article I found says that text files use fixed-length fields...this sounds like when it writes a string of 3 characters to a file, it might actually write a character of padding plus those 3 characters (assuming the fixed length was 4).
Success_Machine's Avatar Success_Machine
Impossible? I can do that (321 posts)
Old 10-10-03, 01:07 PM
 #5
Reply With Quote   Success_Machine is offline
You are referring to files opened for Random Access. Fields have to be fixed length so that you can jump to any point in the file and find the data you are looking for. If the field doesn't contain enough data to fill it, then it will be padded, just to ensure all fields are identical size. If the fields are not all the same size, then there is no way of knowing the address of a particular chunk of data. Binary files use any size fields, so for this reason they have to be read sequentially.

I still can't figure out why my binary file is larger than the original text file. The original file was 99 percent numbers, and in binary output numbers are supposed to get compressed. I should have seen a significant reduction in file size, not the opposite.
malkiri
Registered Senior User (198 posts)
Old 10-10-03, 01:12 PM
 #6
Reply With Quote   malkiri is offline
What article in MSDN were you referring to earlier?
Success_Machine's Avatar Success_Machine
Impossible? I can do that (321 posts)
Old 10-10-03, 01:30 PM
 #7
Reply With Quote   Success_Machine is offline
Here is the link to MSDN library.... Using Binary File Access

Quotes:

"...you can conserve disk space by building variable-length records. Use binary access when it is important to keep file size small...."

"...You can minimize the use of disk space by using binary access."
malkiri
Registered Senior User (198 posts)
Old 10-10-03, 02:42 PM
 #8
Reply With Quote   malkiri is offline
Okay, I'm confused. The article seems to say you can only use variable-length records when you're writing a binary file.

To best appreciate binary access, consider a hypothetical Employee Records file. This file uses fixed-length records and fields to store information about employees.

Regardless of the actual contents of the fields, every record in that file takes 209 bytes.

You can minimize the use of disk space by using binary access. Because this doesnt require fixed-length fields, the type declaration can omit the string length parameters.
So it's saying that files written in this manner will be smaller than files written using fixed-length records, not an equivalent text file.
Success_Machine's Avatar Success_Machine
Impossible? I can do that (321 posts)
Old 10-10-03, 03:31 PM
 #9
Reply With Quote   Success_Machine is offline
You are confusing output "mode" with data representation. When you open a file for output, you have to specify one of at least 3 output modes for that file:

1. Open file for output as #1 , allows strings or numbers to be written to the file which can subsequently be read by a text editor. Data from this file has to be read sequentially, since it is a hodgepodge of data in variable sized chunks, using delimiters such as commas or tabs to separate chunks.

2. Open file for random access write as #1, allows strings or numbers to be written to the file in equal-sized chunks. If a chunk does not contain enough data, it will be padded to ensure all chunks are equal size. The file can subsequently be read from the file non-sequentially, that is you can access data in the middle of the file directly by knowing its address without first accessing all the data before it.

3. Open file for binary access write as #1, allows strings or numbers to be written to the file in chunks of variable size comforming to the size of the data in the chunk. Therefore no padding is used, but the file must be read sequentially -- you must start reading the file at the top to access data in the middle. Delimiters are not used to separate data chunks, whole records are written and retrieved at once. Text editors cannot read this file, as all the data appears as gibberish.


An interesting example that seems to corroborate my problems with bloated binary files compared to text files, is a simple program I just wrote that writes an identical chunk of data to both a text file and a binary file, then compares the file size. Here are the results:

Data = Null
Text file size = 8 bytes
Binary file size = 10 bytes

Data = "."
Text file size = 5 bytes
Binary file size = 5 bytes

Data = "A"
Text file size = 5 bytes
Binary file size = 5 bytes

Data = "AA"
Text file size = 6 bytes
Binary file size = 6 bytes

Data = "5"
Text file size = 3 bytes
Binary file size = 6 bytes

Data = "50"
Text file size = 4 bytes
Binary file size = 6 bytes

Data = "550"
Text file size = 5 bytes
Binary file size = 6 bytes

Data = "5.5"
Text file size = 5 bytes
Binary file size = 10 bytes

Data = "-5"
Text file size = 4 bytes
Binary file size = 10 bytes

Data = "-5.5"
Text file size = 6 bytes
Binary file size = 10 bytes


I don't really see a predictable pattern, except that text files seem to be consistently smaller than binary files, in keeping with my earlier problem! So I don't know why all those people before me published info stating that binary files are smaller, or how they arrived at that conclusion?!?!??
malkiri
Registered Senior User (198 posts)
Old 10-10-03, 04:01 PM
 #10
Reply With Quote   malkiri is offline
I wasn't confusing output modes and representation - I was trying to point out that the article itself doesn't make any assertions that outputting in binary will result in a smaller file than a corresponding text file, which is what you were asking about. It only says it'll be smaller than what you get when you output the same data represented in a fixed-length record. It is true to say that data written in binary have the potential to be smaller than when written as text files, but it won't always be the case.

I can't be specific about what's going on in the binary files, since I can't see them and don't know how VB implements the output. I can point out a few things:
Data = "."
Text file size = 5 bytes
Binary file size = 5 bytes

Data = "A"
Text file size = 5 bytes
Binary file size = 5 bytes

Data = "AA"
Text file size = 6 bytes
Binary file size = 6 bytes
Any ASCII character is going to be a single byte, regardless of how you output it. It can't be any less because there are 256 characters, and so all 8 bits are needed to represent them. You can see this when you add the second "A"...the file size goes up one byte.
Data = "5"
Text file size = 3 bytes
Binary file size = 6 bytes

Data = "50"
Text file size = 4 bytes
Binary file size = 6 bytes

Data = "550"
Text file size = 5 bytes
Binary file size = 6 bytes
If you notice, the binary file size remains at 6 bytes in all 3 of these. That's because these three integers can be represented in the same number of bytes (I'm guessing 2 bytes, but it could be 4...try writing a number above 2<sup>16</sup> if you want to find out).

My guess is some overhead is taking up the extra bytes in the binary files. I'd also predict you'll see the text files gaining size faster than binary files when you move to larger files.

As far as your original issue...perhaps you had a lot of negative integers in the text file that were read into the program and stored in 32 bit variables? The text file would have a "-1" (2 bytes) where the binary file would have "11111111 11111111 11111111 11111111" (4 bytes). There are other places where the representations differ like this, going both directions, such as floats & doubles. It's impossible to make a general case for which one will be smaller.
Reply

Bookmarks

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -5. The time now is 05:54 PM.