Fujitsu Laboratories has tackled the problem of conducting real-time analysis of huge volumes of data with the develoopment of a high-speed data processing technology.
Social media is generating enormous volumes of data, and real-world time-series data from sensors and locational information is continuously increasing. More than just data storage, it is important that big data undergo a variety of analyses to quickly extract any valuable information. A typical example of big data use is recommendation analysis that estimates a person's next action based on social-media data or purchasing history. The process of tracing the connection between data elements contained in the flood of incoming messages, however, is hampered by the fact that the volume of data results and intermediate data of ongoing analysis is too large to be stored in memory.
To handle a volume of data that cannot be contained in memory, a hard drive needs to be used as a storage device. The best way to make the most of the hard drive's performance is to continuously record large units of data, but if the unit is too big, performance declines and processing times lengthen. Conversely, when recording small units of data, there is an increase in disk access as data arrives with a higher frequency which diminishes performance. The ideal read/write unit is dictated by the frequency with which data arrives, so different conditions will result in different levels of efficiency and performance.
Fujitsu claims that its new technology increases overall system performance by creating a close link between the data-analysis software running on the server and the data-management software that handles the data-storage process, then varying the volume of data being processed at any one time in response to the frequency of processing requests from the data-analysis side. Even when there is a sharp increase in people accessing the server, high-speed analyses can still be performed.
Here is how it works:
When reading data, the data-management software reads not only the data requested by the data-analysis side but also other data that is laid out nearby on the hard drive. The data-analysis software then selects and uses the necessary parts of this data. Likewise, when writing data, the data-analysis software specifies data that is not necessary and passes it to the data-management side, which then takes the bulk of data it receives and deploys it on the disk as near as the physical layout makes possible.
Performing read/write operations in larger bulks reduces the number of disk accesses and increases the system's overall throughput.
To process as much as possible at one time, the data-analysis software reads more data than is needed, then the needed data is selected for processing. The ideal size for a bulk read will vary depending on conditions, so the system monitors the volume of arriving data and the pace of analysis to decide the size for bulk read/writes, making automatic adjustments for the best performance.
Fujitsu says that the new technology results in throughput five times faster than previously possible.
This technology can be used to distribute information to multiple users on a moving train based on the train's location, such as updated information on nearby sites or events of interest or trendy restaurants. In e-commerce, were a website to experience a sharp increase in the number of users accessing it before Christmas, it could still remain highly responsive. Performing big data analysis in real-time opens up new potential applications and business uses.
Fujitsu plans to bring the technology it into commercial use in fiscal 2014.