我有一个非常大的数据集,看起来像这样
Column A
Date
2016-02-29 15:59:59.674
2016-02-29 15:59:59.695
2016-02-29 15:59:59.716
2016-02-29 15:59:59.752
2016-02-29 15:59:59.804
2016-02-29 15:59:59.869
2016-02-29 15:59:59.888
2016-02-29 15:59:59.941
2016-02-29 16:00:00.081 <-- get closest date since .081 < .941
2016-02-29 16:00:00.168
2016-02-29 16:00:00.189
2016-02-29 16:00:00.198
2016-02-29 16:00:00.247
2016-02-29 16:00:00.311
2016-02-29 16:00:00.345
2016-02-29 16:00:00.357
and for the other half an hour
2016-02-29 16:29:58.628
2016-02-29 16:29:58.639
2016-02-29 16:29:58.689
2016-02-29 16:29:58.706
2016-02-29 16:29:58.761
2016-02-29 16:29:58.865
2016-02-29 16:29:59.142
2016-02-29 16:29:59.542
2016-02-29 16:29:59.578
2016-02-29 16:30:00.171 <-- Get this date since .171 < .578
2016-02-29 16:30:00.209
2016-02-29 16:30:00.217
2016-02-29 16:30:00.245
2016-02-29 16:30:00.254
2016-02-29 16:30:00.347
2016-02-29 16:30:00.422
2016-02-29 16:30:00.457
2016-02-29 16:30:00.491
2016-02-29 16:30:00.555
2016-02-29 16:30:00.557
2016-02-29 16:30:00.645
现在数据集中的总行数约为5468389,对于excel来说,将所有内容都导入到一列中非常大,因此我尝试分批处理数据。
还有其他方法吗?我可以通过它来处理所有数据?我试图直接读取和写入文本,但是每当我尝试将其作为日期读取时,Type Mismatch
由于格式问题都会给我一个错误。出于同样的原因,我没有使用python来解决这个问题,也因为我也不精通python,所以我想到了在Excel VBA中进行此操作。
我也不太确定这个逻辑,所以我需要一些帮助。
Option Explicit
Sub Get_Closest_Dates()
Application.ScreenUpdating = False
Dim WI As Worksheet, WO As Worksheet
Dim i As Long, ct As Long
Dim num1 As Integer, num2 As Integer, num3 As Integer
Dim df1, df2
Set WI = Sheet1 'INPUT SHEET
Set WO = Sheet2 'OUTPUT SHEET
WI.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS"
WO.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS"
WI.Range("B1") = "HOUR"
WI.Range("C1") = "MINUTE"
With WI
.Range("B2").Formula = "=HOUR(A2)"
.Range("B2:B" & Rows.Count).FillDown
.Range("C2").Formula = "=MINUTE(A2)"
.Range("C2:C" & Rows.Count).FillDown
ct = WO.Range("A" & Rows.Count).End(xlUp).Row + 1
For i = 2 To 10000
num1 = .Range("C" & i).Value 'get Minutes
num2 = .Range("C" & i + 1).Value
If (num1 = 29 And num2 = 30) Then
df1 = 0.5 - TimeValue(.Range("A" & i))
df2 = TimeValue(.Range("A" & i + 1)) - 0.5
If df1 < df2 Then
WO.Range("A" & ct) = .Range("A" & i)
ct = ct + 1
Else
WO.Range("A" & ct) = .Range("A" & i + 1)
ct = ct + 1
End If
End If
If (num1 = 59 And num2 = 0) Then
df1 = 1 - TimeValue(.Range("A" & i))
df2 = TimeValue(.Range("A" & i + 1)) - 1
If df1 < df2 Then
WO.Range("A" & ct) = .Range("A" & i)
ct = ct + 1
Else
WO.Range("A" & ct) = .Range("A" & i + 1)
ct = ct + 1
End If
End If
Next i
End With
Application.ScreenUpdating = True
MsgBox "Process Completed"
End Sub
另外我不确定如何从日期中获取毫秒部分,以避免计算两个日期之间的差
就像15:59:59.674我674
该如何从时间中获得收益?
似乎您的第一个问题是将数据导入Excel。了解Excel可能不是处理大量数据的最佳程序(Access等DB程序可能更好),因此您需要将数据拆分到多个列或工作表中;或取样数据。
您选择了采样,因此在您读取数据时,我将进行采样和测试。
在处理包括毫秒在内的日期/时间戳时,您还必须处理Excel / VBA限制。
但是出于测试数据的目的,无需担心毫秒。只要您的数据是按升序排列的,则第一行的日期/时间戳等于或大于您的30分钟的增量将是最早的一行。
下面的代码应仅读取符合该条件的巨大文件中的行。请阅读注释以获取更多信息。
将这些行收集到一个集合中;然后声明一个结果数组,将其填充,并将结果写入工作表。
如果每行包含多个字段,而不仅仅是显示的单个行,那么在编写结果时,您将声明结果数组以容纳所有列,并在那时将其填充。
使用Collection / Array /写入工作表序列比处理每一行时将每一行一次写入工作表要快得多。
有一些方法可以加快代码的速度,也有解决可能的“内存不足”错误的方法,但这取决于您的实际数据以及此简单代码的处理方式。
到目前为止,我们需要将Excel解释为字符串的日期/时间戳转换为“实际”日期/时间,这取决于您要对后续数据进行的处理。
==========================================
Option Explicit
'Set Reference to Microsoft Scripting Runtime
Sub GetBigData()
Dim FSO As FileSystemObject
Dim TS As TextStream
Dim vFileName As Variant
Dim sLine As String
Dim dtLineTime As Date
Dim dtNextTime As Date
Dim colLines As Collection
vFileName = Application.GetOpenFilename("Text Files(*.txt), *.txt")
If vFileName = False Then Exit Sub
Set FSO = New FileSystemObject
Set TS = FSO.OpenTextFile(vFileName, ForReading, False, TristateFalse)
Set colLines = New Collection
With TS
'Assumes date/time stamps are contiguous
'skip any header lines
Do
sLine = .ReadLine
Loop Until InStr(sLine, ".") > 0
'Compute first "NextTime"
' note that it might be the first entry
' comment line 3 below if want first entry
' but would need to add logic if using other time increments
dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1))
dtNextTime = Int(dtLineTime) + TimeSerial(Hour(dtLineTime), Int(Minute(dtLineTime) / 30) * 30, 0)
If Not (Minute(dtLineTime) = 30 Or Minute(dtLineTime) = 60) Then dtNextTime = dtNextTime + TimeSerial(0, 30, 0)
Do
'Due to IEEE rounding problems, need to test equality as a very small value
'Could use a value less than 1 second = 1/86400 or smaller
If Abs(dtLineTime - dtNextTime) < 0.00000001 Or _
dtLineTime > dtNextTime Then
colLines.Add sLine
dtNextTime = dtNextTime + TimeSerial(0, 30, 0)
End If
If Not .AtEndOfStream Then
sLine = .ReadLine
dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1))
End If
Loop Until .AtEndOfStream
.Close
End With
'Write the collection to the worksheet
Dim V As Variant
Dim wsResults As Worksheet, rResults As Range
Dim I As Long
Set wsResults = Worksheets("sheet1")
Set rResults = wsResults.Cells(1, 1)
ReDim V(1 To colLines.Count, 1 To 1)
Set rResults = rResults.Resize(UBound(V, 1), UBound(V, 2))
For I = 1 To UBound(V, 1)
V(I, 1) = CStr(colLines(I))
Next I
With rResults
.EntireColumn.Clear
.NumberFormat = "@"
.Value = V
.EntireColumn.AutoFit
End With
End Sub
==========================================
EDIT添加了时间戳转换功能。这可以在将数据从集合对象复制到变量数组的时候实现。例如:
V(I, 1) = ConvertTimeStamp(colLines(I))
由于接收到的值是Double数据类型,因此您还需要适当地设置工作表上的该列的格式,而不是将其设置为Text:
.NumberFormat = "yyyy-mm-dd hh:mm:ss.000"
由于VBA日期类型的数据不支持毫秒,因此我们必须以Double形式返回该值。
=============================
Private Function ConvertTimeStamp(sTmStmp As String) As Double
Dim dtPart As Date
Dim dMS As Double 'milliseconds
Dim V As Variant
'Convert the date and time
V = Split(sTmStmp, ".")
dtPart = CDate(V(0))
dMS = V(1)
ConvertTimeStamp = dtPart + dMS / 86400 / 1000
End Function
=============================
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句