使用Java读取真正的大文件

PeakGen 发表于 Dev

匹克

我正在Servlet中读取一个77MB的文件，将来将是150GB。该文件不是使用任何类型的nio软件包编写的，而只是使用编写的BufferedWriter。

现在这就是我需要做的。

逐行读取文件。每行是文本的“哈希码”。将其分成3个字符（3个字符代表1个单词），可能很长，也可能很短，我不知道。
阅读该行后，将其转换为真实单词。我们有一个单词和哈希表，所以我们可以找到单词。

到现在为止，我一直BufferedReader在阅读文件。它很慢，不适用于150GB之类的大文件。即使对于这个77MB的文件，也要花费数小时才能完成整个过程。因为我们不能让用户等待几个小时，所以应该在几秒钟内。因此，我们决定将文件加载到内存中。首先，我们考虑将每一行加载到LinkedList中，因此内存coulkd将其保存。但您知道，内存无法节省这么多钱。经过大搜索之后，我决定将文件映射到内存将是答案。内存比磁盘快，因此我们也可以超快地读取文件。

码：

public class MapRead {

    public MapRead()
    {
        try {
            File file = new File("E:/Amazon HashFile/Hash.txt");
            FileChannel c = new RandomAccessFile(file,"r").getChannel();

            MappedByteBuffer buffer = c.map(FileChannel.MapMode.READ_ONLY, 0,c.size()).load();

            for(int i=0;i<buffer.limit();i++)
            {
                System.out.println((char)buffer.get());
            }

            System.out.println(buffer.isLoaded());
            System.out.println(buffer.capacity());



        } catch (IOException ex) {
            Logger.getLogger(MapRead.class.getName()).log(Level.SEVERE, null, ex);
        }
    }


}

但是我看不到任何“超快速”的东西。我需要逐行。我有几个问题要问。

您阅读了我的描述，您知道我需要做什么。我已经完成了第一步，那是正确的吗？
我的地图方式正确吗？我的意思是，这与以常规方式阅读没有区别。那么，这是否首先将“整个”文件保存在内存中？（让我们使用称为的技术Mapping）然后我们必须编写另一个代码来访问该内存？
如何以超级“快速”方式逐行阅读？（如果我必须先将整个文件加载/映射到内存数小时，然后在几秒钟内以超快的速度访问它，那我也很好）
在Servlet中读取文件好吗？（因为要访问的人数众多，因此一次只能打开一个IO流。在这种情况下，该servlet将一次被数千访问）

更新资料

这就是我用SO用户Luiggi Mendoza的答案更新代码时的样子。

public class BigFileProcessor implements Runnable {
    private final BlockingQueue<String> linesToProcess;
    public BigFileProcessor (BlockingQueue<String> linesToProcess) {
        this.linesToProcess = linesToProcess;
    }
    @Override
    public void run() {
        String line = "";
        try {
            while ( (line = linesToProcess.take()) != null) {

                System.out.println(line); //This is not happening
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}


public class BigFileReader implements Runnable {
    private final String fileName;
    int a = 0;

    private final BlockingQueue<String> linesRead;
    public BigFileReader(String fileName, BlockingQueue<String> linesRead) {
        this.fileName = fileName;
        this.linesRead = linesRead;
    }
    @Override
    public void run() {
        try {

            //Scanner do not work. I had to use BufferedReader
            BufferedReader br = new BufferedReader(new FileReader(new File("E:/Amazon HashFile/Hash.txt")));
            String str = "";

            while((str=br.readLine())!=null)
            {
               // System.out.println(a);
                a++;
            }

        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}



public class BigFileWholeProcessor {
    private static final int NUMBER_OF_THREADS = 2;
    public void processFile(String fileName) {

        BlockingQueue<String> fileContent = new LinkedBlockingQueue<String>();
        BigFileReader bigFileReader = new BigFileReader(fileName, fileContent);
        BigFileProcessor bigFileProcessor = new BigFileProcessor(fileContent);
        ExecutorService es = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
        es.execute(bigFileReader);
        es.execute(bigFileProcessor);
        es.shutdown();
    }
}



public class Main {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        // TODO code application logic here
        BigFileWholeProcessor  b = new BigFileWholeProcessor ();
        b.processFile("E:/Amazon HashFile/Hash.txt");
    }
}

我正尝试在中打印文件BigFileProcessor。我的理解是

用户输入文件名
该文件BigFileReader被逐行读取
在每一行之后，BigFileProcessor调用get。这意味着，假设BigFileReader阅读第一行。现在BigFileProcessor称为。现在，BigFileProcessor完成该行的处理后，现在将BigFileReader读取第2行。然后再次BigFileProcessor调用该行的get，依此类推。

可能是我对此代码的理解不正确。无论如何我应该如何处理生产线？

路易吉·门多萨

我建议在这里使用多线程：

一个线程会注意读取文件的每一行，并将其插入BlockingQueue以便处理。
另一个线程将take对此队列中的元素进行处理。

要实现此多线程工作，最好使用ExecutorService接口和传递Runnable实例，每个实例应实现每个任务。请记住，只有一个任务可以读取文件。

您还可以管理一种方法，以在队列具有特定大小的情况下停止读取，例如，如果队列具有10000个元素，则等待其大小减小到8000，然后继续读取并填充队列。

在Servlet中读取文件好吗？

我建议不要在servlet中做繁重的工作。相反，例如通过JMS调用触发异步任务，然后在此外部代理中将处理文件。

以上说明的简要示例解决了该问题：

public class BigFileReader implements Runnable {
    private final String fileName;
    private final BlockingQueue<String> linesRead;
    public BigFileReader(String fileName, BlockingQueue<String> linesRead) {
        this.fileName = fileName;
        this.linesRead = linesRead;
    }
    @Override
    public void run() {
        //since it is a sample, I avoid the manage of how many lines you have read
        //and that stuff, but it should not be complicated to accomplish
        Scanner scanner = new Scanner(new File(fileName));
        while (scanner.hasNext()) {
            try {
                linesRead.put(scanner.nextLine());
            } catch (InterruptedException ie) {
                //handle the exception...
                ie.printStackTrace();
            }
        }
        scanner.close();
    }
}

public class BigFileProcessor implements Runnable {
    private final BlockingQueue<String> linesToProcess;
    public BigFileProcessor (BlockingQueue<String> linesToProcess) {
        this.linesToProcess = linesToProcess;
    }
    @Override
    public void run() {
        String line = "";
        try {
            while ( (line = linesToProcess.take()) != null) {
                //do what you want/need to process this line...
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

public class BigFileWholeProcessor {
    private static final int NUMBER_OF_THREADS = 2;
    public void processFile(String fileName) {
        BlockingQueue<String> fileContent = new LinkedBlockingQueue<String>();
        BigFileReader bigFileReader = new BigFileReader(fileName, fileContent);
        BigFileProcessor bigFileProcessor = new BigFileProcessor(fileContent);
        ExecutorService es = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
        es.execute(bigFileReader);
        es.execute(bigFileProcessor);
        es.shutdown();
    }
}

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-02-7

我来说两句

0条评论

登录后参与评论

上一篇：如何在PHP中强制foreach重置

来自分类Dev

Related 相关文章

文章

使用Java读取真正的大文件

使用Java读取真正的大文件

码：

使用Java NIO读取和写入大文件

读取大文件错误“ outofmemoryerror”（java）

读取大文件（Java堆空间）

使用apache common io读取大文件

使用Akka Streams读取大文件

使用PowerShell读取大文件并删除回车

使用apache common io读取大文件

逐行读取大文件

读取大文件的中间

使用JAVA使用AES加密大文件

Java中的大文件读取不一致

Node.js使用fs.readFileSync读取大文件

attoparsec高内存使用率读取大文件

如何使用Python read（）一次读取大文件

如何在C中使用函数read（）读取大文件

读取大文件C＃

循环读取多个大文件

从大文件读取JSON对象

对大文件使用ddply

如何在C ++中使用混合文本和二进制文件读取大文件

在Java中使用rsa加密和解密大文件

扫描仪读取大文件

在Spark问题中读取大文件-python

在C＃中读取大文件

读取大文件（以字节为单位）

awk无法读取大文件的内容

C ++中的大文件读取错误

如何从大文件中读取特定行？

在 blogdown 帖子中读取大文件