It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center。
7年前关闭。
我有300万行数据,每行数据都有30个功能-很难将所有内容包含在计算机的内存中,并且很难用学习算法进行处理-。我想编写一些随机抽样的代码,但是在JAVA中,并且在我的PC配置中,它无法正常工作或执行大量时间。我知道用C或C ++编写可以提供更好的解决方案,但我也对这种情况下python的可用性感到好奇。在Java由于速度慢和内存限制而无法有效运行的情况下使用Python是否合理-请不要说增加堆大小或这样的情况?
参考方案
如果性能至关重要,这就是我使用的那种解决方案。
public class SimpleTable {
private final List files = new ArrayList();
private final List buffers = new ArrayList();
private final File baseDir;
private final int rows;
private SimpleTable(File baseDir, int rows) {
this.baseDir = baseDir;
this.rows = rows;
}
public static SimpleTable create(String baseName, int rows) throws IOException {
File baseDir = new File(baseName);
if (!baseDir.mkdirs()) throw new IOException("Failed to create " + baseName);
PrintWriter pw = new PrintWriter(baseName + "/rows");
pw.println(rows);
pw.close();
return new SimpleTable(baseDir, rows);
}
public static SimpleTable load(String baseName) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(baseName + "/rows"));
int rows = Integer.parseInt(br.readLine());
br.close();
File baseDir = new File(baseName);
SimpleTable table = new SimpleTable(baseDir, rows);
File[] files = baseDir.listFiles();
Arrays.sort(files);
for (File file : files) {
if (!file.getName().endsWith(".float")) continue;
table.addColumnForFile(file);
}
return table;
}
private FloatBuffer addColumnForFile(File file) throws IOException {
RandomAccessFile rw = new RandomAccessFile(file, "rw");
MappedByteBuffer mbb = rw.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, rows * 8);
mbb.order(ByteOrder.nativeOrder());
FloatBuffer db = mbb.asFloatBuffer();
files.add(rw);
buffers.add(db);
return db;
}
public int rows() {
return rows;
}
public int columns() {
return buffers.size();
}
public FloatBuffer addColumn() throws IOException {
return addColumnForFile(new File(baseDir, String.format("%04d.float", buffers.size())));
}
public FloatBuffer getColumn(int n) {
return buffers.get(n);
}
public void close() throws IOException {
for (RandomAccessFile file : files) {
file.close();
}
files.clear();
buffers.clear();
}
}
public class SimpleTableTestMain {
public static void main(String... args) throws IOException {
long start = System.nanoTime();
SimpleTable st = SimpleTable.create("test", 3 * 1000 * 1000);
for (int i = 0; i < 50; i++) {
FloatBuffer db = st.addColumn();
for (int j = 0; j < db.capacity(); j++)
db.put(j, i + j);
}
st.close();
long mid = System.nanoTime();
SimpleTable st2 = SimpleTable.load("test");
for (int i = 0; i < 50; i++) {
FloatBuffer db = st2.getColumn(i);
double sum = 0;
for (int j = 0; j < db.capacity(); j++)
sum += db.get(j);
assert sum > 0;
}
long end = System.nanoTime();
System.out.printf("Took %.3f seconds to write and %.3f seconds to read %,d rows and %,d columns%n",
(mid - start) / 1e9, (end - mid) / 1e9, st2.rows(), st2.columns());
st2.close();
}
}
版画
Took 2.070 seconds to write and 2.206 seconds to read 3,000,000 rows and 50 columns