대용량 텍스트 파일을 메모리에 로드하지 않고 한 줄씩 읽는 방법

IT이야기

대용량 텍스트 파일을 메모리에 로드하지 않고 한 줄씩 읽는 방법

cyworld 2022. 3. 20. 12:52

대용량 텍스트 파일을 메모리에 로드하지 않고 한 줄씩 읽는 방법

나는 한 줄씩 큰 파일을 읽어야 한다.파일의 용량이 5GB를 초과하고 각 행을 읽어야 하지만, 분명히 나는 사용하고 싶지 않다.readlines()왜냐하면 그것은 기억 속에 매우 큰 목록을 만들 것이기 때문이다.

이 경우 아래 코드는 어떻게 작동할까?이다xreadlines기억 속으로 하나씩 읽어가는 것 자체가?제너레이터 표현식이 필요한가?

f = (line for line in open("log.txt").xreadlines())  # how much is loaded in memory?

f.next()

게다가 리눅스처럼 이것을 역순으로 읽으려면 어떻게 해야 할까.tail명령하시겠습니까?

찾은 항목:

http://code.google.com/p/pytailer/

그리고

"텍스트 파일의 선으로 머리, 꼬리 및 뒤로 읽기"

둘 다 아주 잘했어!

나는 이 대답을 제공했다. 왜냐하면 Keith's는 간결하지만 파일을 명확하게 닫지 않기 때문이다.

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

파일 객체를 반복기로 사용하기만 하면 된다.

for line in open("log.txt"):
    do_something_with(line)

더욱 좋은 것은 최근의 파이썬 버전에서 컨텍스트 매니저를 사용하는 것이다.

with open("log.txt") as fileobject:
    for line in fileobject:
        do_something_with(line)

이렇게 하면 파일도 자동으로 닫힌다.

오래된 학교 접근 방식:

fh = open(file_name, 'rt')
line = fh.readline()
while line:
    # do stuff with line
    line = fh.readline()
fh.close()

대신 반복기를 사용하는 것이 좋다.
관련: — 여러 입력 스트림의 라인에서 반복.

문서에서:

import fileinput
for line in fileinput.input("filename", encoding="utf-8"):
    process(line)

이렇게 하면 전체 파일을 한 번에 메모리로 복사하는 것을 피할 수 있다.

다음을 시도해 보십시오.

with open('filename','r',buffering=100000) as f:
    for line in f:
        print line

파일에 새로운 라인이 없는 경우 수행할 작업:

with open('large_text.txt') as f:
  while True:
    c = f.read(1024)
    if not c:
      break
    print(c)

@john-la-rooy의 대답이 그렇게 보이게 한 것처럼 쉽다는 것이 믿어지지 않았다.그래서 나는 그 모습을 재현했다.cp대오를 읽고 쓰면서 명령하다크레이지 패스트야.

#!/usr/bin/env python3.6

import sys

with open(sys.argv[2], 'w') as outfile:
    with open(sys.argv[1]) as infile:
        for line in infile:
            outfile.write(line)

그 화재 프로젝트는 지난 6년 동안 많은 진전을 이루었다.그것은 판다 특징의 유용한 부분집합을 포괄하는 간단한 API를 가지고 있다.

dask.dataframe은 내부적으로 청킹 작업을 처리하고, 많은 병렬 작업을 지원하며, 메모리 내 작업을 위해 조각을 팬더로 쉽게 내보낼 수 있다.

import dask.dataframe as dd

df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows

# iterate rows
for idx, row in df.iterrows():
    ...

# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()

# slice by column
df[df.my_field=='XYZ'].compute()

메모리 문제를 일으키지 않고 임의 크기의 텍스트 파일을 로드하기 위한 코드를 리스트한다.기가바이트 크기의 파일 지원

https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

data_use_utils.py 파일을 다운로드하여 코드로 가져오기

사용법

import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000


def process_lines(data, eof, file_name):

    # check if end of file reached
    if not eof:
         # process data, data is one single line of the file

    else:
         # end of file reached

data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)

process_lines 방법은 콜백 함수다.매개변수 데이터가 한 번에 하나의 파일 행을 나타내는 모든 행을 호출할 것이다.

시스템 하드웨어 구성에 따라 CUK_SIZE 변수를 구성할 수 있다.

이건 어때?파일을 읽을 때 운영 체제가 다음 줄을 캐시하므로 파일을 청크로 나눈 다음 한 줄씩 읽으십시오.파일을 한 줄씩 읽는 경우, 캐시된 정보를 효율적으로 사용하지 않는 경우.

대신 파일을 청크로 나누고 전체 청크를 메모리에 로드한 다음 처리를 수행하십시오.

def chunks(file,size=1024):
    while 1:

        startat=fh.tell()
        print startat #file's object current position from the start
        fh.seek(size,1) #offset from current postion -->1
        data=fh.readline()
        yield startat,fh.tell()-startat #doesnt store whole list in memory
        if not data:
            break
if os.path.isfile(fname):
    try:
        fh=open(fname,'rb') 
    except IOError as e: #file --> permission denied
        print "I/O error({0}): {1}".format(e.errno, e.strerror)
    except Exception as e1: #handle other exceptions such as attribute errors
        print "Unexpected error: {0}".format(e1)
    for ele in chunks(fh):
        fh.seek(ele[0])#startat
        data=fh.read(ele[1])#endat
        print data

고마워!나는 최근에 python 3으로 변환했고 큰 파일을 읽기 위해 리드라인(0)을 사용함으로써 좌절했다.이것으로 문제가 해결되었다.하지만 각 대사를 맞추려면 몇 걸음 더 걸어야 했다.각 행에는 이진 형식으로 된 "b"가 선행되었다."decode(utf-8)"를 사용하여 이를 아스키식으로 변경했다.

그리고 각 행의 중간에 있는 「=\n」을 제거해야 했다.

그리고 새 줄에서 줄을 쪼개서 섰다.

b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
        a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
        data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
        data_list = data_chunk.split('\n')  #List containing lines in chunk
        #print(data_list,'\n')
        #time.sleep(1)
        for j in range(len(data_list)): #iterate through data_list to get each item 
            i += 1
            line_of_data = data_list[j]
            print(line_of_data)

아로히 코드의 "인쇄 데이터" 바로 위에서 시작하는 코드가 여기에 있다.

이에 관한 최고의 해결책과 330MB 파일로 사용해 보았다.

lineno = 500
line_length = 8
with open('catfour.txt', 'r') as file:
    file.seek(lineno * (line_length + 2))
    print(file.readline(), end='')

여기서 line_length는 한 줄에 있는 문자 수입니다.예를 들어 "abcd"는 줄 길이 4를 가진다.

'\n' 문자를 건너뛰고 다음 문자로 이동하기 위해 줄 길이 2개를 추가했다.

나는 이것이 꽤 오래 전에 대답되었다는 것을 알고 있지만, 여기 메모리 오버헤드를 죽이지 않고 병렬로 하는 방법이 있다(각 회선을 풀 속으로 발사하려고 하면 그럴 것이다).readJ를 확실히 교환SON_line2는 감각적인 것을 위해 기능한다 - 단지 여기서 요점을 설명하기 위한 것이다!

Speedup은 파일 크기 및 각 행에 따라 달라지지만 작은 파일의 경우 최악의 경우 JSON 판독기로만 읽으면 아래 설정과 함께 ST와 유사한 성능을 볼 수 있다.

누군가에게 유용하기를 바란다:

def readJSON_line2(linesIn):
  #Function for reading a chunk of json lines
   '''
   Note, this function is nonsensical. A user would never use the approach suggested 
   for reading in a JSON file, 
   its role is to evaluate the MT approach for full line by line processing to both 
   increase speed and reduce memory overhead
   '''
   import json

   linesRtn = []
   for lineIn in linesIn:

       if lineIn.strip() != 0:
           lineRtn = json.loads(lineIn)
       else:
           lineRtn = ""
        
       linesRtn.append(lineRtn)

   return linesRtn




# -------------------------------------------------------------------
if __name__ == "__main__":
   import multiprocessing as mp

   path1 = "C:\\user\\Documents\\"
   file1 = "someBigJson.json"

   nBuffer = 20*nCPUs  # How many chunks are queued up (so cpus aren't waiting on processes spawning)
   nChunk = 1000 # How many lines are in each chunk
   #Both of the above will require balancing speed against memory overhead

   iJob = 0  #Tracker for SMP jobs submitted into pool
   iiJob = 0  #Tracker for SMP jobs extracted back out of pool

   jobs = []  #SMP job holder
   MTres3 = []  #Final result holder
   chunk = []  
   iBuffer = 0 # Buffer line count
   with open(path1+file1) as f:
      for line in f:
            
          #Send to the chunk
          if len(chunk) < nChunk:
              chunk.append(line)
          else:
              #Chunk full
              #Don't forget to add the current line to chunk
              chunk.append(line)
                
              #Then add the chunk to the buffer (submit to SMP pool)                  
              jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
              iJob +=1
              iBuffer +=1
              #Clear the chunk for the next batch of entries
              chunk = []
                            
          #Buffer is full, any more chunks submitted would cause undue memory overhead
          #(Partially) empty the buffer
          if iBuffer >= nBuffer:
              temp1 = jobs[iiJob].get()
              for rtnLine1 in temp1:
                  MTres3.append(rtnLine1)
              iBuffer -=1
              iiJob+=1
            
      #Submit the last chunk if it exists (as it would not have been submitted to SMP buffer)
      if chunk:
          jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
          iJob +=1
          iBuffer +=1

      #And gather up the last of the buffer, including the final chunk
      while iiJob < iJob:
          temp1 = jobs[iiJob].get()
          for rtnLine1 in temp1:
              MTres3.append(rtnLine1)
          iiJob+=1

   #Cleanup
   del chunk, jobs, temp1
   pool.close()

이 방법은 병렬로 작업하고 데이터 청크만 읽고 새 라인으로 청결하게 유지하려는 경우에 유용할 수 있다.

def readInChunks(fileObj, chunkSize=1024):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        while data[-1:] != '\n':
            data+=fileObj.read(1)
        yield data

참조URL: https://stackoverflow.com/questions/6475328/how-can-i-read-large-text-files-line-by-line-without-loading-it-into-memory

'IT이야기' 카테고리의 다른 글

완료/오류 제거된 관찰 가능 등록을 취소해야 하는가? (0)	2022.03.21
python의 탄생일로부터 나이 (0)	2022.03.21
Vuex는 물리적으로 어디에 저장되어 있는가? (0)	2022.03.20
Vue 라우터의 경로에 대한 조건별 매핑 구성 요소 (0)	2022.03.20
vuejs로 양식을 제출하면 양식 태그를 사용해야 하는가? (0)	2022.03.20

현재글대용량 텍스트 파일을 메모리에 로드하지 않고 한 줄씩 읽는 방법

각종 프로그래밍 정보를 다루는 블로그입니다.

유치원, 축제, 여행, javascript, 숙박, jQuery, 관광, 가족나들이, 펜션, 놀거리, 공연, 행사, Java, 연극, 볼거리, 주말나들이, 뮤지컬, c#, spring3, 경기,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

cyworld

대용량 텍스트 파일을 메모리에 로드하지 않고 한 줄씩 읽는 방법

대용량 텍스트 파일을 메모리에 로드하지 않고 한 줄씩 읽는 방법

'IT이야기' 카테고리의 다른 글

'IT이야기'의 다른글

티스토리툴바

대용량 텍스트 파일을 메모리에 로드하지 않고 한 줄씩 읽는 방법

대용량 텍스트 파일을 메모리에 로드하지 않고 한 줄씩 읽는 방법

'IT이야기' 카테고리의 다른 글

'IT이야기'의 다른글

관련글

티스토리툴바