Answer a question

I would like to split a large text file around size of 50GB into multiple files. Data in the files are like this-[x= any integer between 0-9]

xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
...............
...............

There might be few billions of lines in the file and i would like write for example 30/40 millions per file. I guess the steps would be-

  • I've to open the file
  • then using readline() have to read the file line by line and write at the same time to a new file
  • and as soon as it hits the maximum number of lines it will create another file and starts writing again.

I'm wondering, how to put all these steps together in a memory efficient and faster way. I've seen some examples in stack but none of them totally helping what i exactly need. I would really appreciate if anyone could help me out.

Answers

This working solution uses split command available in shell. Since the author has already accepted a possibility of a non-python solution, please do not downvote.

First, I created a test file with 1000M entries (15 GB) with

awk 'BEGIN{for (i = 0; i < 1000000000; i++) {print "123.123.123.123"} }' > t.txt

Then I used split:

split --lines=30000000 --numeric-suffixes --suffix-length=2 t.txt t

It took 5 min to produce a set of 34 small files with names t00-t33. 33 files are 458 MB each and the last t33 is 153 MB.

Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐