2014年5月26日星期一

java large data processing program

Problem Description : Recently made ​​a butt of project data is from a remote ftp server , take the csv file, and then put csv file data entry into the database project's done , let's say I read a csv file , about nearly 60,000 data in the entry process, I have to check every record exists in the database, and if there is no change, which will filter out the data , if it exists, and there is a change , do update , if not, do inserted.
treatment options : Calculate total data of n , each processing 5000 data, and then calculate the total processing times , and then processed according to the total number of circulating this 5000 data, and then connect to the database batch this 5000 data , total execution time is approximately 10 minutes , the speed slowly? Is there a better solution ?
------ Solution ---------------------------------------- ----
first, when an insert statement can be used in batch mode execution of a number of group
Second, do not frequently open and close the database connection , you can use some mature jdbc connection pool
Third, if the memory is sufficient enough to be considered for a query over data cached locally , reducing database IO
------ Solution -------------- ------------------------------

not parse the file and update the database is slow, and that the presence of contrast No little slow, after a period of frequent open and close the database connection  
contrast slow consider using md5, all the fields are composed of a string value , such as : I field 1_ Field 2 , according to the strings take md5, into a field , this field is indexed , compared You can check the presence of this field to distinguish. Database connections easier , whether it is an ordinary connection or data source , a batch file or use a connection on the line.
------ Solution ---------------------------------------- ----
If the input data is updated to the table B, if all the data can be considered first insert a table a, then table a and table B take joint inquiry , there is no insertion , exists and there is a change to update the data in table A data after clearing so quickly if the first batch of data into the database , simply perform updates and inserts , they do not compare frequent queries .
------ Solution ------------------------------------- -------
put each file into a temporary table, and then execute the stored procedure is not better, so the database to perform these operations
------ Solution ------ --------------------------------------
the stored data to a csv Lane Zhang Xin table, use the line I do not know whether to compare sql
------ Solution ------------- -------------------------------
your data , there are three
1, the database does not exist , the insert operation
2, database exists , update operations.

consider critical information to initialize the database exists in the cache before each treatment , all the data priority update , insert after treatment ?
In other words, each time processing unnecessary circulation and database queries to the database to reduce interaction. ,
In addition, there is no lower estimates, the amount of data and how much data each existing share of the database does not exist ?
------ Solution ---------------------------------------- ----

- ----- Solution --------------------------------------------
temporary table , why not allow ah ?
------ Solution ---------------------------------------- ---- not

parse the file and update the database slower contrast existence is a bit slow , after frequent open and close during the database connection          
contrast slow consider using md5, all the fields are composed of a string value , such as : I field 1_ Field 2 , according to the strings take md5, into a field , this field is indexed , compared You can check the presence of this field to distinguish. Database connections easier , whether it is an ordinary connection or data source , a batch file or use a connection on the line.  

It is a good idea , but if the combination is cached in the md5 value , it may be more efficient.
------ Solution ---------------------------------------- ----
landlord can do for each record hash verification , not md5, checking each time to insert data about this record hash exists, If there is discarded , otherwise inserted.

with a hash map to record the hash
------ Solution ------------------------- -------------------
use stored procedures to determine the data in a stored procedure , the stored procedure to determine whether the presence of data , the speed it should be able to quickly ?
------ Solution ---------------------------------------- ----
personal recommendations , the first data fed into a temporary table, disconnected , pitted a dedicated data processing SP, can certainly speed up .
------ Solution ---------------------------------------- ----
with HashMap likely to cause the machine to save memory is not enough , there was still a good point to put the cache .
------ Solution ---------------------------------------- ----

- ----- For reference only ---------------------------------------
data engage in a one , you have to figure out where is the biggest bottleneck. Contrast existence is slow , slow , or parse the file , or update data slow ?
------ For reference only -------------------------------------- -
add that once you 5000 batch too little, you can consider 50,000 to perform a
------ For reference only -------------- -------------------------
feel this problem should not be solved with a loop .

read this entry if there are very frequent , then it is not a relational database can solve the

newcomers, to utter the great God do not hit me
------ For reference only ---------------------------------------

execute insert statement batch is carried out , according to the second point I have to query the database for each data exists, there is a frequently open and close the database connection is not likely to result in slower ? Third, I inquired of the data or to filter out , or to update the database or database entry are placed in a list , and then make a batch ,
------ For reference only - --------------------------------------

3q, consider < br> ------ For reference only ---------------------------------------

parse the file and update the database does not slow contrast existence is a bit slow , after a period of frequent open and close the database connection
------ For reference only ----- ----------------------------------

perform batch insert statement is carried out , on the second point , I have to query the database based on the existence of each of the data , there is frequently open and close the database connection is not likely to result in slower ? Third, I inquired of the data or to filter out , or to update the database or database entry are placed in a list , and then make a batch ,  
The second point is that, if you need to frequently open and close the connection , use the connection pool
The third point I said cache means that if a data query over, then put it in the local cache , the next time the first look at the memory, if memory has the same , do not go to the database where to find the
------ For reference only ----------------------------------- ----

perform batch insert statement is carried out , according to the second point I have to query the database for each data exists, there is a frequently open and close the database connection is not likely to lead to rate slow ? Third, I inquired of the data or to filter out , or to update the database or database entry are placed in a list , and then make a batch ,          
The second point is that, if you need to frequently open and close the connection , use the connection pool   
The third point I said cache means that if a data query over, then put it in the local cache , the next time the first look at the memory, if memory has the same , do not go to the database where to find the   After
On the second point , I studied under the third point I inquired about a piece of data , it will not use this data to query the database again , thank you very much , but this one was the connection pool study click
------ For reference only ------------------------------------- -

parse the file and update the database does not slow contrast existence is a bit slow , after a period of frequent open and close the database connection          
contrast slow consider using md5, all the fields are composed of a string value , such as : I field 1_ Field 2 , according to the strings take md5, into a field , this field is indexed , compared You can check the presence of this field to distinguish. Database connections easier , whether it is an ordinary connection or data source , a batch file or use a connection on the line.  
The main problem of slow time in the database in this regard , thank you very much for your answer
------ For reference only ------------------- --------------------
  The reply was deleted administrator at 2014-05-24 08:45:01

------ For reference only ---------------------------------- -----

can be, but is not allowed
------ For reference only ------------------ ---------------------

can, that is not allowed
------ For reference only ----- ----------------------------------
  The reply was deleted administrator at 2014-05-24 08:45:09

------ For reference only ---------------------------------- -----
same data comparison is based on what ?
------ For reference only -------------------------------------- -
not do a temporary table , it 's going to be hurt ,
cycle to be individually checked for importing and 5000 , respectively, a guide should be no bar ,
------ For reference only ----------------- ----------------------
first glance looks like the whole amount of synchronization , but a closer look it is not , the case did not consider deleting data .
------ For reference only -------------------------------------- -
  The reply was deleted administrator at 2014-05-24 08:44:15

------ For reference only ---------------------------------- -----
  The reply was deleted administrator at 2014-05-24 08:44:47

------ For reference only ---------------------------------- -----
  The reply was deleted administrator at 2014-05-24 08:44:39

------ For reference only ---------------------------------- -----
  The reply was deleted administrator at 2014-05-24 08:44:31

------ For reference only ---------------------------------- -----
  The reply was deleted administrator at 2014-05-24 08:44:23

------ For reference only ---------------------------------- -----

relatively few fields that are the same
------ For reference only ------------------- --------------------

first synchronization , the whole amount of synchronization , synchronization may be just the second first few records only
------ For reference only ---------------------------------------
thank you very much for the answer , learning , 3Q
------ For reference only --------------------------- ------------
hash it, to be indexed
------ For reference only ----------------- batch processing mode ----------------------
and I have exactly the same system maintenance . Our system is now csv35 million pieces of data , but also encountered a bottleneck in processing speed , I was ERP. Last month conducted a simple optimization. You have the following question for opinion : ① did not need to check whether the data exists ( waste of time ) , directly to do updates, if an update result is 0 , consider the insert, so basically do not have any processing time expenses. Better performance can be considered merge. ② build a temporary table is a good solution , I do not know why you are not allowed , since we allow , the effect is also very good

没有评论:

发表评论